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Course: Compiler Design and Construction(CSC 352) 


Introduction to Compiler 


What are compilers 

• To most people, a compiler is a "black box program" which takes source code written 
in some high-level language, such as FORTRAN, BASIC, Pascal, or C, and translates 
(compiles) it into a language the computer can understand and execute. Compilers 
take source code and transform it into an equivalent code in a target language. 
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Cousins of the Compiler 

Pre-Processors : The pre-processors are those programs which perform a pre-compilation of the 
source program to expand any macro definitions. 

Loader and Linkers : If the target program is machine code, loaders are used to load the target 
code into memory for execution. Linkers are used to link target program with the libraries. 

Interpreter: Interpreters perform compilation, loading and execution in lock -steps 
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JIT Compilers : Just-in-time(JIT) compilers perform complete compilation followed immediately 
by loading and execution. JIT compilers represent a hybrid approach, with translation occurring 
continuously, as with interpreters, but with caching of translated code to minimize performance 
degradation 
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Compilers vs. Interpreters 

• Computers cannot understand English words and grammar. Even the highly 
structured words and sentences of programming languages must be translated before 
a computercan understand them. 

• The compiler or interpreter must look up each "word" of our programming language in 
a kind of dictionary (or lexicon) and, in a series of steps, translate it into machine 
code. Each word initiates a separate logical task. 

• An interpreter translates one line of source code at a time into machine code, and then 
executes it. Debugging and testing is relatively fast and easy in interpreted 
languages, since the entire program doesn't have to be reprocessed each time a 
change is made. E.g. BASIC,COBOL, MySQL 

• Interpreted programs run much slower than compiled programs, because they must 
be translated each time they are run. Programmers often test and debug their 
programs using an interpreter and then compile them for production use. 

How Compilers Work 

Most compilers convert programs in three steps. Each step is called a pass. A particular 
compiler may have one program per pass, or may combine two or three steps in a 
single program. For a very complex language, a step may be so difficult to perform that it is 
broken up into many smaller steps. Regardless of how many passes or programs are 
required, the compiler performs only three main functions: 

1. Lexical analysis: The stream of characters forming the source program are scanned 
linearly to produce a stream of logical elements called tokens. 

2. Syntax analysis: The stream of tokens are grouped into hierarchical collections called 
syntax groupings. 

3. Code generation. Generating code in the target language. 

During each pass of the compiler, the source code moves closer to becoming virtual 
machine language (or whatever language the compiler is designed to generate). 

Phases of Compilers 

For compilation of any source program written in some source language, the compiler passes 
from Analysis and Synthesis phases 


Analysis 

Analyze source program and build an 
intermediate representation 
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Synthesis 

Generate target program from 
intermediate representation 
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Analysis of source program includes: Lexical analysis. Syntax analysis and Semantic analysis 

Lexical Analysis 

In the first pass of the compiler, the source code is passed through a lexical analyzer, 
which converts the source code to a set of tokens. A token generally representing some 
keyword in the language. A compiler has a unique number for each keyword (i.e. IF, 
WHILE, END), and each arithmetic or logical operator (i.e. +, -, *, AND, OR, etc.). Numbers 
are represented by a token which indicates that what follows it should be 
interpreted as a number. 

The other important task of the lexical analyzer is to build a symbol table. This is a 
table of all the identifiers (variable names, procedures, and constants) used in the 
program. When an identifier is first recognized by the analyzer, it is inserted into the 
symbol table, along with information about its type, where it is to be stored, and so forth. This 
information is used in subsequent passes of the compiler. E.g. 

dist = 0.5 * x * sqr(t) input character stream 

| dist | | - || 0.5 ]R | x | |~*~|| Sqr(t)~| 

I I 1 I I 1 I 

id op num op id op fncall output token stream 

id: identifier, op: operator, num: number, fncall: function call 
Syntax Analysis 

After the lexical analyzer translates a program into tokens of keywords, variables, constants, 
symbols and logical operators, the compiler makes its next pass. To describe what happens 
during this function, recall the concept of grammars. 

Grammars. Like any language, programming languages have a set of rules governing the 
structure of the program. Each different computer language has its own grammar which makes 
it unique. Some grammars are complex and others are relatively easy. The programmer must 
observe all the structural rules of a language to make logical sense to the computer. The next 
step of the compiling process , parsing, checks to be sure all the rules were followed. 

Parsing. The parsing routines of a compiler check to see that the program is written 
correctly (according to the language rules described by grammars). The parser reads in the 
tokens generated by the lexical analyzer and compares them to the set grammar of the 
programming language. If the program follows the rules of the language, then it is 
syntactically correct. When the parser encounters a mistake, it issues a warning or error 
message and tries to continue. 


Prep By: HGC 




Course: Compiler Design and ConstructionfCSC 352) 


Introduction to Compiler 


statement 


Assignment 



id 


expr 



I num * expr 



\ id , * fncall 


■r 


Token stream 


Example of parse tree construction from token stream 


Some parsers try to correct a faulty program, others do not. When the parser reaches the 
end of the token stream, it will tell the compiler that either the program is grammatically 
correct and compiling can continue or the program contains too many errors and compiling 
must be aborted. If the program is grammatically correct, the parser will call for semantic 
routines. 

Semantic Analysis 

The semantic routines of a compiler perform two tasks: checking to make sure that each 
series of tokens will be understood by the computer when it is fully translated to machine 
code, and converting the series of tokens one step closer to machine code. 

• The first task takes a series of tokens, called a production, and checks it to see if it 
makes sense. For example, a production may be correct as far as the parser is 
concerned, but the semantic routines check whether the variables have been declared, 
and are of the right type, etc. 

• If the production makes sense, the semantic routine reduces the production for the 
next phase of compilation, code generation. Most of the code for the compiler lies 
here in the semantic routines and thus takes up a majority of the compilation time. 

Thus, two major routines comprise syntax analysis: the parsing routine and the semantic 
routine. The parser checks for the correct order of the tokens and then calls the semantic 
routines to check whether the series of tokens (a production) will make sense to the 
computer. The semantic routine then reduces the production another step toward complete 
translation to machine code. 
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Synthesis : concerns issues involving generating code in the target language. It usually 
consists of the following phases. 

• Intermediate code Generation 

• Code optimization 

• Final Code generation 

An intermediate language is often used by many compilers for analyzing and optimizing the 
source program. Intermediate language should have two important properties: 

It should be simple and easy to produce and 
It should be easy to translate target program. 

Example: 

tempi = int2float(sqr(t)); 
temp2 = g*templ; 
temp3 = 0.5*temp2, 
dist = temp3 

The code generation process determines how fast the code will run and how large it will be. 
The first part of code generation involves optimization, and the second involves final target 
code generation. 


Code Optimization. In this step, the compiler tries to make the intermediate code generated by 
the semantic routines more efficient. This process can be very slow and may not be able to 
improve the code much. Because of this, many compilers don't include optimizers, and, if 
they do, they look only for areas that are easy to optimize. 

- Detection of redundant function call 

- Detection of loop invariants 

- Common sub expression compilation 

- Dead-dode detection and elimination 

Final Code Generation. This process takes the intermediate code produced by the optimizer 
(or semantic routines if the compiler has no optimizer) and generates final code in target 
language. It is this part of the compilation phase that is machine dependent. Each type of 
computer has an operating system that processes virtual machine code differently; therefore, 
the code generator must be different for each type of computer. 

If the program is free from syntactical errors, code generation should take place without 
any problem. When the code generator is finished, the code produced will be in target code. 

It is in a format (an .OBJfile in C compiler) that is ready to go to a linker, which will 
create an executable * EXE or *.COM etc file from the machine code the compiler has 
generated. 
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Summary: Phases of a Compiler 



Target Program 
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The Symbol table: A symbol table stores information about keywords and tokens found during 
lexical analysis. The symbol table is consulted in almost all phases of the compiler. 

For example: 

insert(“disf’,id); // inserts a symbol table entry associating the string “dist” with token type : id 
lookup(“dist”); // an occurrence of the string “dist” can be looked up in the symbol table. If 
found, a reference to its token type is returned , else lookup return 0. 


Error handling: 

-Errors may be encountered in any phases of compiler. 

-Objective of error handling is to go as far as possible in the compilation whenever an error is 
encountered. 

For example: 

- Handling missing symbols during lexical analysis by inserting symbols. 

-Automatic type conversion during semantic analysis. 
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Lexical Analysis 

This is the initial part of reading and analyzing the program text. The text is read and 
divided into tokens, each of which corresponds to a symbol in a programming language, e.g. a 
variable name, keyword or number etc. So a lexical analyzer or lexer, will as its input take a 
string of individual letters and divide this string into tokens. It will discard comments and white- 
space (i.e. blanks and newlines). 

Overview of Lexical Analysis 

A lexical analyzer, also called a scanner, typically has the following functionality and 
characteristics. 

• Its primary function is to convert from a sequence of characters into a sequence of tokens. 
This means less work for subsequent phases of the compiler. 

• The scanner must Identify and Categorize specific character sequences into tokens. It 
must know whether every two adjacent characters in the file belong together in the same 
token, or whether the second character must be in a different token. 

• Most lexical analyzers discard comments & whitespace. In most languages these 
characters serve to separate tokens from each other. 

• Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly 
to the user. 

• Efficiency is crucial; a scanner may perform elaborate input buffering 

• Token categories can be (precisely, formally) specified using regular expressions, e.g. 

• IDENTIFIER=[a-zA-Z][a-zA-ZO-9]* 

• Lexical Analyzers can be written by hand, or implemented automatically using finite 
automata. 
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Role of Lexical Analyzer 



Figure:- Interaction of lexical analyzer with parser 


The lexical analyzer works in lock step with the parser. The parser requests the lexical 
analyzer for the next token whenever it requires one using getnexttoken(). Lexical analyzer may 
also perform other auxiliary operations like removing redundant white spaces, removing token 
separators (like semicolon ;) etc. Some other operations performed by lexer, includes removal of 
comments, providing line number to the parser for error reporting. The Function of a lexical 
Analyzer is to read the input stream representing the Source program, one character at a time and 
to translate it into valid tokens. 

Issues in Lexical Analysis 

There are several reasons for separating the analysis phase of compiling into lexical analysis and 
parsing. 

1) Simpler design is the most important consideration. The separation of lexical analysis from 
syntax analysis often allows us to simplify one or the other of these phases. 

2) Compiler efficiency is improved. 

3) Compiler portability is enhanced. 

Tokens, Patterns, Lexemes 

In compiler, a token is a single word of source code input. Tokens are the separately 
identifiable block with collective meaning. When a string representing a program is broken into 
sequence of substrings, such that each substring represents a constant, identifier, operator , 
keyword etc of the language, these substrings are called the tokens of the Language. They are the 
building block of the programming language. E.g. if, else, identifiers etc. 
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Lexemes are the actual string matched as token. They are the specific characters that 
make up of a token. For example abc and 123. A token can represent more than one lexeme, i.e. 
token intnum can represent lexemes 123, 244, 4545 etc. 

Patterns are rules describing the set of lexemes belonging to a token. This is usually the 
regular expression. E.g. intnum token can be defined as [0-9] [0-9]*. 

Attribute of tokens 

When a token represents more than one lexeme, lexical analyzer must provide additional 
information about the particular lexeme, i.e. In case of more than one lexeme for a token, we 
need to put extra information about the token. This additional information is called the attribute 
of the token. For e.g. token id matched varl, var2 both, here in this case lexical analyzer must be 
able to represent varl, and var2 as different identifiers. 

Example: take statement, area =3.1416 * r * r 

1. getnexttoken() returns (id, attr) where attr is pointer to area to symbol table 

2. getnexttoken() returns (assign) no attribute is needed, if there is only one 
assignment operator 

3. getnexttoken() returns (floatnum, 3.1416) where 3.1416 is the actual value of 
floatnum 

.... So on. 

Token type and its attribute uniquely identify a lexeme. 

Lexical Errors 

Though error at lexical analysis is normally not common, there is possibility of errors. 
When the error occurs the lexical analyzer must not halt the process. So it can print the error 
message and continue. Error in this phase is found when there are no matching string found as 
given by the pattern. Some error recovery techniques includes like deletion of extraneous 
character, inserting missing character, replacing incorrect character by correct one, transposition 
of adjacent characters etc. Lexical error recovery is normally expensive process. For e.g. finding 
the number of transformation that would make the correct tokens. 
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Implementing Lexical Analyzer (Approaches) 

1. Use lexical analyzer generator like flex that produces lexical analyzer from the given 
specification as regular expression. We do not take care about reading and buffering the 
input. 

2. Write a lexical analyzer in general programming language like C. We need to use the I/O 
facility of the language for reading and buffering the input. 

3. Use the assembly language to write the lexical analyzer. Explicitly mange the reading if 
input. 


The above strategies are in increasing order of difficulty, however efficiency may be achieved 
and as a matter of fact since we deal with characters in lexical analysis, it is better to take some 
time to get efficient result. 

Look Ahead and Buffering 

Most of the time recognizing tokens needs to look ahead several other characters after the 
matched lexeme before the token match is returned. For e.g. int is a keyword in C but integer is 
an identifier so when the scanner reads i, n, t then this time it has to look for other characters to 
see whether it is just int or some other word. In this case at next time we need move back to 
rescan the input again for the characters that are not used for the lexeme and this is time 
consuming . To reduce the overhead, and efficiently move back and forth input buffering 
technique is used. 


Input Buffering 

We will consider look ahead with 2N buffering and using the sentinels. 


N 


\n 


p rin tf(“h e 11 o world”); 


float x , y , z ; 

i \ 

lexeme forward 
pointer pointer 

Figure:- An input buffer in two halves 

We divide the buffer into two halves with N-characters each, normally N is the number if 
characters in one disk block like 1024 or 4096. Rather than reading character by character from 


e o f 
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file we read N input character at once. If there are fewer than N character in input eof marker is 
placed. There are two pointers (see fig in previous slide). The portion between lexeme pointer 
and forward pointer is current lexeme. Once the match for pattern is found both the pointers 
points at same place and forward pointer is moved. This method has limited look ahead so may 
not work all the time say multi line comment in C. In this approach we must check whether end 
of one half is reached (two test each time if forward pointer is not at end of halves) or not each 
time the forward pointer is moved. The number of such tests can be reduced if we place sentinel 
character at the end of each half. 


N 


flo a t x , y , z ; . 

t x 

lexeme forward 
p o in te r pointer 


\n prin eof tf(“hello world”); 


eof 


Figure:- Sentinels at end of each buffer half 


if forward points end of first half 
reload second half 
forward++; 

else ifforward points end of second half 
reload first half 
forward = start of first half 

else 

forward++ 

Figure:- Code to advance forward pointerffirst scheme) 


forward++ 
if forward points eof 

if forward points end of first half 
reload second half 
forward++; 

else ifforward points end of second half 
reload first half 
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forward = start of first half 


else 

terminate lexical analysis 

Figure:- Lookahead code with sentinels 

Specifications of Tokens 

Regular expressions are an important notation for specifying patterns. Each pattern 
matches a set of strings, so regular expressions will serve as names for sets of strings. 

Some Definitions 

■ Alphabets 

o An alphabet A is a set of symbols that generate languages. For e.g. {0-9} is an 
alphabet that is used to produce all the non negative integer numbers. {0-1} is an 
alphabet that is used to produce all the binary strings. 

■ String 

o A string is a finite sequence of characters from the alphabet A Given the alphabet 
A, A 2 = A. A is set of strings of length 2, similarly A n is set of strings of length n. 
The length of the string w is denoted by |w| i.e. numbet of characters (symbols) in 
w.We also have A 0 ={s}, where s is called empty string. |s| denotes the length of 
the string s. 

■ Kleene Closure 

o Kleene closure of an alphabet A denoted by A* is set of all strings of any length 

(0 also) possible from A. Mathematically A* = A 0 u A 1 u A 2 u.For any 

string, w over alphabet A, w e A*. 

■ A language F over alphabet A is the set such that LcA*. 

■ The string s is called prefix of w, if the string s is obtained by removing zero or more 
trailing characters from w. If s is a proper prefix, then s ^ w. 

■ The string s is called suffix of w, if the string s is obtained by deleting zero or more 
leading characters from w. we say s as proper suffix if s w. 

■ The string s is called substring of w if we can obtain s by deleting zero or more leading 
or trailing characters from w. We say s as proper substring if s ^ w. 

■ Regular Operators 
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o The following operators are called regular operators and the language formed 
called regular language. 

■ . Concatenation operator, R.S = (rs | r e R and s e S }. 

■ * -> Kleene * operator^ ={jA l 

■ +/u/|-> Choice/union operator, RuS={t|teRorteS}. 

Regular Expression (RE) 

We use regular expression to describe the tokens of a programming language. 

Basis Symbol 

■ s is a regular expression denoting language { s } 

■ a € A is a regular expression denoting {a} 

If r and s are regular expressions denoting languages Li(r) and L 2 (s) respectively, then 

■ r + s is a regular expression denoting Li(r) u L 2 (s) 

■ rs is a regular expression denoting Li(r) L 2 (s) 

■ r* is a regular expression denoting (Li(r) )* 

■ (r) is a regular expression denoting Li(r) 

Examples 

■ (l+0)*0 is RE that gives the binary strings that are divisible by 2. 

■ 0*10*+0* 110* is RE that gives binary strings having at most 2 1s. 

■ (l+0)*00 denotes the language of all strings that ends with 00 (binary number multiple of 

4) 

■ (01)* + (10)* denotes the set of all strings that describes alternating Is and 0s 

Properties of RE 

■ r+s = s+r ( + is commutative) 

■ r+(s+t) = (r+s)+t; r(st) = (rs)t (+ and . are associative) 

■ r(s+t) = (rs)+(rt); (r+s)t =(rt)+st) (. distributes over +) 

■ sr = rs (s is identity element) 

■ r* = (r+s)* (relation between * and s) 

■ r** = r* (* is idempotent) 
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Regular Definitions 

To write regular expression for some languages can be difficult, because their regular 
expressions can be quite complex. In those cases, we may use regular definitions, i.e defining RE 
giving a name and reusing it as basic symbol to produce another RE, gives the regular definition. 
Example 

■ di ri, d 2 r 2 , ..., d n -> r n , Where each d,s are distinct name and qs are REs over the 
alphabet Au{di, d 2 ,... di-ij 

■ In C the RE for identifiers can be written using the regular definition as 

o letter a + b +_+z + A + B + ...+ Z. 

o digit 0 + 1 + .. .+9. 
o identifier (letter + _)(letter + digit + _)* 

[Note: Remember recursive regular definition may not produce RE, that means 

digits —* digits digits \ digits is wrong!!! 

Recognition of Tokens 

A recognizer for a language is a program that takes a string w, and answers “YES” if w is 
a sentence of that language, otherwise “NO”. The tokens that are specified using RE are 
recognized by using transition diagram or finite automata (FA). Starting from the start state we 
follow the transition defined. If the transition leads to the accepting state, then the token is 
matched and hence the lexeme is returned, otherwise other transition diagrams are tried out until 
we process all the transition diagram or the failure is detected. Recognizer of tokens takes the 
language L and the string s as input and try to verify whether seLor not. There are two types of 
Finite Automata. 

1. Deterministic Finite Automaton (DFA) 

2. Non Deterministic Finite Automaton (NFA) 

Deterministic Finite Automaton (DFA) 

FA is deterministic, if there is exactly one transition for each (state, input) pair. It is faster 
recognizer but it make take more spaces. DFA is a five tuple (S, A, So, 8, F) where, 
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S —► finite set of states 
A —>• finite set of input alphabets 
So —> starting state 

8 —»• transition function i.e. S:S x A —> S 
F —» set of final states FcS 

Implementing DFA 

The following is the algorithm for simulating DFA for recognizing given string. For a 
given string w, in DFA D, with start state qo, the output is “YES”, if D accepts w, otherwise 
“NO”. 

DFASim(D, q 0 ) { 

q = qo; 

c = getchar(); 
while (c != eof) 

{ 

q = move (q, c); //this is 8function, 
c = getchar(); 

} 

if (s is in H) 

return “yes // if D accepts s 

else 

return “false”; 

} 

Figure:- Simulating DFA 
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Non Deterministic Automata (NFA) 

FA is non deterministic, if there is more than one transition for each (state, input) pair. It 
is slower recognizer but it make take less spaces. An NFA is a five tuple (S, A, So, 8 , F) where, 

S —► finite set of states 
A —»• finite set of input alphabets 
So —> starting state 

5 —> transition function i.e. S:S x A —» 2' s ' 

F —*■ set of final states FcS 


Examples 1,0 



The above state machine is an NFA having a non - deterministic transition at state SI. On 
reading 0 it may either stay in SI or goto S3 (accept). It’s regular expression is 1(1 + 0)*0. 
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1 



The above figure shows a state machine with e moves that is equivalent to the regular expression 
(10)* + (01)*. The upper half of the a machine recognizes (10)* and the lower half recognizes 
(01)*. The machine non - deterministically moves to the upper half of the lower half when 
started from state sO. 

Algorithm 

S = s-closure({So}) // set of all states that can be accessed from So by s-transitions 
c = getchar() 
while(c != eof) 

{ 

S = s-closure(move(S,c)) //set of all states that can be accessible from a state in S by a 
transition on c 

c = getchar() 

} 

if (S nFf</>) then 

return “YES” 

else 

return “NO” 

FigureSimulating NFA 
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Reducibility 

1. NFAtoDFA 

2. REtoNFA 

3. REtoDFA 

NFA to DFA (sub set construction) 

This sub set construction is an approach for an algorithm that constructs DFA from NFA, 
that recognizes the same language. Here there may be several accepting states in a given subset 
of nondeterministic states. The accepting state corresponding to the pattern listed first in the 
lexical analyser generator specification has priority. Here also state transitions are made until a 
state is reached which has no next state for the current input symbol. The last input position at 
which the DFA entered an accepting state gives the lexeme. We need the following operations. 

■ s-closure(S) —> the set of NFA states reachable from NFA state S on s-transition 

■ s-closure(T) —> the set of NFA states reachable from some NFA states S in T on s- 
transition 

■ Move(T,a) —> the set of NFA states to which there is a transition on input symbol a from 
NFA state S in T. 

Subset Construction Algorithm 

Put s-closure(So) as an unmarked state in DStates 
While there is an unmarked state T in DStates do 
Mark T 

For each input symbol ae Ado 
U = s-closure(move(T,a)) 

If U is not in DStates then 

Add U as an unmarked state to DStates 

End if 

DTran[T, a] = U 

End do 
End do 
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■ DStates is the set of states of the new DFA consisting of sets of states of the NFA 

■ DTran is the transition table of the new DFA 

■ A set of DStates is an accepting state of DFA if it is a set of NFA states containing at 
least one accepting state of NFA 

■ The start state of DFA is s-closure(So) 



£-closure(move(S'o,a)) = s-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = Si —» Si into DStates as an 
unmarked state 

£-closure(move(S'o,b)) = £-closure({5}) = {1,2, 4, 5, 6, 7} = S 2 —> S 2 into DStates as an 

unmarked state 

DTran[So, a] <— Si 

DTran[S 0 , b] <- S 2 

Mark Si 

E-closuretmove^a)) = £-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = Si 
£-closure(move( J S , ;,b)) = £-closure({5}) ={1,2, 4, 5, 6, 7} = S 2 
DTran[S u a]<- Si 
DTran[S u b] <- S 2 
MarkS2 

£-closure(move(5' 2 ,a)) = £-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = Si 
£-closure(move(5' 2 ,b)) = £-closure({5}) ={1,2, 4, 5, 6, 7} = S 2 
DTran[ S 2 , a] <- Si 
DTran[S 2 , b] <— S 2 


Bikash Balami 



15 


50 is the start of DFA, since 0 is a member of So = {0, 1, 2, 4, 7} 

51 is an accepting state of DFA, since 8 is a member of Si = {1, 2, 3, 4, 6, 7, 8} 



RE to NFA (Thomson’s Construction) 

Input —*■ RE, r, over alphabet A 
Output —> s-NFA accepting L(r) 

Procedure —> process in bottom up manner by creating s-NFA for each symbol in A including 
s. Then recursively create for other operations as shown below 
1. For a in A and s, both are RE and constructed as 



2. For REs r and s, r.s is RE too and it is constructed as 



i.e. The start of r, becomes the start of r.s and final state of s becomes the final state of r.s 
3. For REs r and s, r+s is RE too and it is constructed as 
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4. For REs r, r* is RE too and it is constructed as 



For REs (r), e-NFA((r)) is e-NFA(r) 

e-NFA(r) has at most twice as many state as number of symbols and operators in r 
since each construction takes a tmost two new states. 
e-NFA(r) has exactly one start and one accepting state. 

Each state of s-NFA(r) has either one outgoing transition on a symbol in A or at most 
two outgoing 8-transitions 


Example 

(a + b)*a To NFA 

For a Forb 
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Important States 

The state s in s -NFA is an important state if it has no null transition. In optimal state 
machine all states are important states 

Augmented Regular Expression 

s -NFA created from RE has exactly one accepting state and accepting state is not important 
state since there is no transition so by adding special symbol # on the RE at the rightmost 
position, we can make the accepting state as an important state that has transition on #. Now 
the accepting state is there in optimal state machine. The RE (r)# is called augmented RE. 

Procedure 

1. Convert the given expression to augmented expression by concatenating it with “#” i.e (r) 

^ (r)#- 

2. Construct the syntax tree of this augmented regular expression. In this tree, all the 
operators will be inner nodes and all the alphabet, symbols including “#” will be leaves. 

3. Numbered each leaves. 
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4. Traverse tree to construct nullable, firstpos, lastpos and followpos. 

5. Construct the DFA from so obtained followpos. 

■ Some Definitions 

o nullable(n): If the subtree rooted at n can have valid string as s, then nullable(n) 
is true, false otherwise. 

o firstpos(n): The set of all the positions that can be the first symbol of the 
substring rooted at n. 

o lastpos(n): The set of all the positions that can be the last symbol of the substring 
rooted at n. 

o followpos(i): The set of all positions that can follow i for valid string of regular 
expression. We calculate follwopos, we need above three functions. 


Rule to evaluate nullable, firstpos and lastpos 


Node- n 

nullable(n) 

firstpos(n) 

lastpos(n) 

leaf labeled s 

True 

{} 

{} 

non null leaf position i 

False 

{i} 

{i} 


nullable(a) 

or 

nullable(b) 

firstpos(a) u firstpos(b) 

lastpos(a) u lastpos(b) 

g/\> 

nullable(a) 

and 

nulllable(b) 

if nullable(a) is true then 

firstpos(a) u firstpos(b) 

else firstpos(a) 

if nullable(b) is true then 

lastpos(a) u lastpos(b) 

else lastpos(b) 


True 

firstpos(a) 

lastpos(a) 
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Computation of followpos 
Algorithm 

for each node n in the tree do 
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if n is a cat-node with left child ci and right child C 2 then 
for each i in lastpos(ci) do 

followpos(i) = followpos(i) u firstpos(c 2 ) 

end do 

else if n is a star-node then 

for each i in lastpos(n) do 

followpos(i) = followpos(i) u firstpos(n) 

end do 

end if 

end do 

Algorithm to create DFA from RE 

1. Create a syntax tree of (r)# 

2. Evalauate the functions nullable, firstpos, lastpos ad followpos 

3. Start state of DFA = S = So = firspos(r), where r is the root of the tree, as an unmarked 
state 

4. While there is unmarked state T in the state of DFA 
Mark T 

for each input symbol a in A do 

Fet si, S 2 ,., s n are positions in S and symbols in those positions is a 

S’ <— followpOS(Si) U.U followpOS(Sn) 

Move(S,a) <— S’ 

if (S’ is not empty and not in the states of DFA) 

Put S’ into states of DFA and unmarked it 

end if 

end do 
end do 
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Note: 

The start state of resulting DFA is firstpos( root) 

The final states of DFA are all states containing the position of#. 

Example 

Construct DFA from RE a(a + b)*bba# 



a b 


Create a syntax tree ^ 

Calculate nullable, firstpos and lastpos 


(f, {1}, {7p 

(f, {1}, {5}) • 


(f. {1}, { 4 }^ 

(f, {1}, {1,2,3}). 

/ \ 

(f. 0}> {!}) 1 


7 (f, {?}, {?}) 
6 (f, {6}, {6}) 


5 (f, {5}, {5}) 


„ 4 (f, {4}, {4}) 

(^*{2,3}, {2,3}) 

| {2, 3}, {2, 3}) 


(f, {2}, {2}) 2 
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Calculate followpos 

Using rules we get 

followpos(l): {2, 3, 4} 
followpos(2): {2, 3, 4} 
followpos(3): {2, 3, 4} 
followpos(4): {5} 
followpos(5): {6} 
followpos(6): {7} 
followpos(7): - 

Now, 

Starting state = S1 = firstpos(root) = {1} 

Mark SI 

For a : followpos(l) = {2, 3, 4} = S2 
For b : (|) 

Mark S2 

For a : followpos(2) = {2, 3, 4} = S2 (because among {2, 3, 4}, position of a is 2) 

For b : followpos(3) u followpos(4) = {2, 3, 4, 5} = S3 (because among {2, 3, 4}, 
position of b is 
3 and 4) 

Mark S3 

For a : followpos(2) = {2, 3, 4} = S2 

For b : followpos(3) u followpos(4) u followpos(5) = {2, 3, 4, 5, 6} = S4 

Mark S4 

For a : followpos(2) u followpos(6) = {2, 3, 4, 7} = S5 (since it contains 7 i.e. #, so it is 
accepting state) 

For b : followpos(3) u followpos(4) u followpos(5) = {2, 3, 4, 5, 6} = S4 

Mark S5 

For a : followpos(2) = {2, 3, 4} = S2 

For b : followpos(3) u followpos(4) = {2, 3, 4, 5} = S3 
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State Minimization in DFA 

DFA minimization refers to the task of transforming a given deterministic finite 
automaton (DFA) into an equivalent DFA which has minimum number of states. Two states p 
and q are called equivalent if for all input string s, 8(p, w) is an accepting state iff 8(q, w) is an 
accepting state., otherwise distinguishable states. We say that, string w distinguishes state s 
from state t if, by starting with DFA M in state s and feeding it input w, we end up in an 
accepting state, but starting in state t and feeding it with same input w, we end up in a non 
accepting state, or vice - versa. It finds the states that can be distinguished by some input string. 
Each group of states that cannot be distinguished is then merged into a single state. 

Procedure 

1. So partition the set of states into two partition a) set of accepting states and b) set of 
nonaccepting states. 

2. Split the partition on the basis of distinguishable states and put equivalent states in a 
group 

3. To split we process the transition from the states in a group with all input symbols. If the 
transition on any input from the states in a group is on different group of states then they 
are distinguishable so remove those states from the current partition and create groups. 

4. Process until all the partition contains equivalent states only or have single state. 
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Partition 1: {{a, b, c, d, e}, {f}} with input 0; a b and b d, c b, d e all transition in 
same group, with input 1; e f (different group) so e is distinguishable from others. 

Partition 2: {{a, b, c, d}, {e}, {f}} with input 0; d e (different group). 

Partition 3: {{a, b, c}, {d}, {e}, {f}} with input 0; b d (Watch!!!) 

Partition 4: {{a, c}, {b}, {d}, {e}, {f}} with both 0 and 1 a, c b so no split is possible here, a 
and c are equivalent. 



Space Time Tradeoffs: NFA Vs DFA 

■ Given the RE r and the input string s to determine whether s is in L(r) we can either 
construct NFA and test or we can construct DFA and test for s after NFA is constructed 
from r. 

■ s- NFA (for NFA only constant time differs) 

■ Space complexity: 0(|r|) (at most twice the number of symbols and operators in 

r). 
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■ Time complexity: 0(|r|*|s|) (test s, there may be 0(|s|) test for each possible 
transition). 

■ DFA 

■ Space complexity: 0(2^') (s- NFA construction and then subset construction). 

■ Time complexity: 0(|s|) (single transition so linear test for s). 

If we can create DFA from RE by avoiding transition table, then we can improve the 
performance 
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Exercises 

1. Given alphabet A = {0, 1}, write the regular expression for the following 

a. Strings that can either have sub strings 001 or 100 

b. String where no two Is occurs consecutively 

c. String which have an odd numbers of 0s 

d. String which have an odd number of 0s and even number of Is 

e. String that have at most 2 0s 

f. String that have at least 3 1s 

g. String that have at most two 0s and at least three 1 s 

2. Write regular definition for specifying integer number, floating number, integer array 
declaration in C. 

3. Convert the following regular expression first into NFA and DFA. 

a. 0 + (1 + 0)*00 

b. zero —> 0 
one —> 1 

bit —> zero + one 
bits —> bit* 

4. Write an algorithm for computing s-closure(s) of any state s in NFA. 

5. Converse the following RE to DFA 

a. (a + b)*a 

b. (a + s) b c * 

6. Describe the languages denoted by the following RE 

a. 0(0+l)*0 

b. ((s + 0)l*)* 

c . (0 + 1 )* 0(0 + 1)(0 + 1 ) 

d. 0*10*10*10* 

e. (00 + 11)* ((01 + 10) (00 + 11)* (01 + 10) (00 + 11)*)* 

7. Construct NFA from following RE and trace the algorithm for ababbab 

a. (a + b)* 

b. (a* + b*)* 

c. ((e + a)b*)*) 
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d. (a + b)* abb(a + b)* 

8. Show that following RE are equivalent by showing their minimum state DFA 

a. (a + b)* 

b. (a* + b*)* 

c. ((s + a)b*)* 
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Chapter 4: Syntax Analysis(Parsing) 

The Role of a Parser: The second phase of the compilation process is syntax analysis commonly 
known as parsing. A parser obtains the tokens from the lexical analyzer and analyzes syntactically 
according to the grammar of the source language whether the string can be generated or not from 
the grammar i.e. the parser works with the lexical analyzer as shown in figure below. 



Figure: Position of parser in compiler model 

A syntax analyzer(parser) is to analyze the source program based on the definition of its syntax, it 
works in lock-up step with the lexical analyzer(scanner) and responsible for creating a parse tree out 
of the source code. 

• A parser implements a Context Free Grammar. 

• Besides the checking of syntax the parser is responsible to report the syntax errors. 

• A parser is also responsible to invoke semantic actions 

- for static semantics checking e.g. type checking of expressions, functions etc 

- for syntax directed translation of the source code to an intermediate representation 

- The possible intermediate representations outputs are 

■ Abstract syntax tree 

■ control-flow graphs(CFGs) with triples, three address code or register 
transfer list notations 

Syntax Error Handling : 

A good compiler should assist in identifying and locating errors. Programs may contain errors at 
different levels such as: 

- Lexical errors: misspelling an identifier,keywords or operators. The compiler can easily 
recover and continue from those types of errors. 

Syntax errors: e.g. an expression with unbalanced parentheses or operator misplaced etc. 
Such errors are most important and can almost always recovered. 

Semantic errors(static type): important and can sometimes be recovered. 
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Semantic errors(Dynamic): Hard or impossible to detect at compile time, runtime check is 
needed. 

Logical errors: hard and impossible to detect by compiler, e.g. infinite loops, recursive calls 
etc. 

Context Free Grammar: 

Programming languages usually have recursive structures that can be defined by context free 
grammars. A CFG is defined by 4-tuples as G=(V,T,P,S) where V is set of variable symbols, T 
stands for set of terminal symbols, P stands for set of productions(rules) and S is a special variable 
called start symbol from which the derivation of each string is started, 
e.g. 

E —> E A E I ( E ) I -E I id 
A —> + 1-1*1 / It 

Here, E and A are non terminals with E as start symbol and other symbols are terminals. 

Derivation: Process of obtaining the terminal strings from the start symbol of the grammar. 

• The term a=$ft denotes that ft can be derived from a by applying a production of a. 

• The term a => ft denotes that ft can be derived by 0 or more production rules from a 

• The term a => ft denotes that ft can be derived by 1 or more production rules from «F 

• If ft can be derived by replacing the left most(right-most) non terminal in every derivation 
steps, then it is called left-most(right-most) derivation of cc 

Leftmost : Replace the leftmost non-ter mi nal symbol 
E=>EAE=>idAE=>id*E=>id*id 
Rightmost : Replace the leftmost non-terminal symbol 
E=>EAE=>EAid=>E*id=>id*id 

EXAMPLE: id * id is a sentence of above grammar, then the derivation is 

E=>EAE =>E * E => id * E => id * id 
E => id * id 

Parse Tree: A graphical representation of the derivation of any string from the grammar 


Example: 


Parse Tree 


Derivation: 

E=>E*E 



=> E + E * E 
=> id + E * E 
=> id + id * E 
=> id + id * id 


id 


id 
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Ambiguity: If a same ter mi nal string can be derived from the grammar using two or more distinct 
left-most derivation(or right-most) then the grammar is said to be ambiguous, i. e. from an 
ambiguous grammar, we can get two or more distinct parse tree for the same terminal string. 

Left -recursion: A left recursive grammar is one that has rules like A => Aa, for some a. Top- 
Down parsing can’t reconcile this type of grammar, since it could consistently make choice which 
wouldn’t allow termination and the parser moves to an infinite loop like as 
A => Aa => Aaa => Aaaa ... etc. for grammar A—> Aa 1P 

The left recursion from the grammar can be removed as: 

Left recursive grammar: 

A —^ Aa I P 

To the following: 

A—>pA’ 

A’ —> aA’ I e 

In general, 

The left recursive rules like 

A —» Aai I Aa 2 1 ... I Aa m I (3i I (3 2 I... I (3 n where no Pi begins with A. This can be converted 
without left recursion as 
A —» PrA’ I p 2 A’ I ... I P„A’ 

A’ -» aiA’ I a 2 A’ I... I a m A’ I e 

Example: 

E —» E + T I T 
T—>T*F I F 
F —»( E) I id 

Here E and T productions contains the left recursion so removing the recursion the grammar 
without the left recursion will be as 

E —» TE' 

E' —» + TE' | e 
T —» FT' 

T' —» * FT' | e 
F-ME) |id 
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Parsing 

Given a stream of tokens , parsing involves the process of reducing them into a non-terminal. The 
inpute string is said to represent the non-terminal it was reduced to. 

Parsing can be either top-down or bottom up. 

• Top down parsing involves generating the string starting from the first non-terminal(start- 
symbol) and repeatedly applying the production of the grammar. 

• Bottom-up parsing involves repeatedly re-writing the input string until it ends up in the start 
non-ter mi nal. 


Following are the Top-down parsing algorithms 

1. Recursive Descent Parsing 

2. Non-recursive predictive parsing 


Consider the grammar 
S—^cAd 
A—>ab la 

and the input string is w = cad, The recursive descent parsing algorithm can be described as below. 

Consider the parsing string a = S( start Symbol) 

1. Let iptr be the index of the input string and optr be the index of the output string, and 
initially 

iptr =optr =0 

Input: iptr(cad); output: optr(S) 

2. while OC [optr] is non-terminal, expand the non-terminal with its first production rule. 

Input: iptr(cad); Output: (optr)cAd 

3. while w[iptr ] = a [optr], increment both iptr and optr. If end of string is reached, then 
success. 


Input: c(iptr)ad; Output: c(optr)Ad 

4. The while loop above stops if, 

a. a non terminal is encountered in a 


5. 


6. 


b. end of string is reached or 

c. if w[iptr] != a [optr] 


if (a) is true, then goto step 2 and expand the non-terminal with the first production. 

Input: c(iptr)ad ; Output: c(optr)abd 

Input: ca(iptr)d; Output: ca(optr)bd 

If (b) is true then exit with success. If (c) is true then revert iptr and optr to the place they 
were the last expansion, and replace the non-terminal with the next production rule. 

Input: c(iptr)ad; Output: c(optr)Ad 

Input: c(iptr)ad; Output: c(optr)ad 

Input: ca(iptr)d; Output: ca(optr)d 

Input: cad(iptr) Output: cad(optr) 
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Exercise: 

Given the following grammar 

R —> idS I (R) I S 
S ->+RS\.RS\*S\s 

Then token id can be one of {a,b}. Determine whether the follwoing strings are in the grammar 
using recursive descent parsing. 

1. a.(a+b)*.b 

2. (b.a.b)* 

3. a.b..b.a 

Non-recursive Predictive parsing 

A recursive descent parser always chooses the first available production whenever encountered by a 
non-terminal. This is inefficient and causes a lot of back-tracking . It is also suffers from the left- 
recursion problem. A predictive parser tries to predict which production produces the least chances 
of a backtracking and infinite looping by choosing the proper production for the derivation of the 
input string. 

Lookahead predictive parser: 

- Predictive parsing relies on information about what first symbol can be generated from a 
production. 

If the first symbol of a production can be a non-terminal then the non-terminal has to be 
expanded till we get a set of ter mi nals. 

For any given strings of ter mi nals and non-terminals a, FIRST( (X ) defines the set of all 
terminals that can be generated from (X. 

Consider the rules: 

type —» simple I id I array [simple] of type 
simple —> integer I char 

then, the FIRST of each symbol can be as: 

FIRST(type) ={integer,char,id,array } 

FIRST(simple) = {integer,char } 

For the each terminal symbol ‘a’ the FIRST(a) = {a} 
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Non-Recursive predictive parsing 


Stack 
NT + T 
symbols of 
CFG 

Empty stack 
symbol 



Non-recursive predictive parsing is a table driven predictive parser comprises of an input 
buffer, stack, parsing table and an output stream. 

The input buffer and the stack are delimited by a special “$’ symbol that denotes the end of 
stack or buffer. 

The parsing table is made of entries of the form M[X,a] = a , which says that if the stack 
points to non-terminal X and input buffer points to symbol ‘a’ then X has to be replaced 
by a , here a may be a set of terminals and non terminals or error. 

The program for the parser behaves as follows: 

1. If the stack top symbol and the input symbol is ’$’, the parser halts and announces 
successful parsing. 

2. If the stack top symbol and the input symbol matches which is not *$’, the parser 
pops the stack and advances the input pointer to the next symbol. 

3. If the stack top symbol X is a non-terminal, then program consults entry M[X,a] of 
parsing table M. This entry will be either X production or an Error. 

■ If it is a X production {X -> UVW } , it replaces the top of the stack X by 
WVU (U becomes the new top symbol) 

■ If M[X,a] = Error, the parser calls an error recovery routine. 


So the algorithm for the parser can be explained as below. 

Input: A string w and a parsing table M for the grammar G. 

Method: Initially the parser is in configuration in which $S on the stack with S on the top and w$ 
in input buffer. 


1. Set ip to the first symbol of the input string. 

2. Set the stack to $S where S is the start symbol of the grammar G. 

3. Let X be the top stack symbol and ‘a’ be the symbol pointed by ip then 

Repeat 

a. If X is a terminal or *$’ then 

i. If X = a then pop X from the stack and advance ip 

ii. else error() 

b. Else 

i. If M[X,a] = YiY 2 Y 3 .Y k then 
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1 . Pop X from Stack 

2 . Push Yk, Yt-i, Y2, Y1 on the stack with Y1 on top. 

3 . Output the production X —> Y1Y2Y3.Yk 

Until X = ‘$\ 

Example: 

E —» TE' 

E' —» + TE' | e 
T —> FT' 

T' —» * FT' | e 
F-» (E) | id 


Given the parsing table for the grammar G as: 


Non 

Terminals 

Inputs 

id 

+ 

* 

( 

) 

$ 

E 

E —TE' 



E—TE' 



E' 


S* — +TE' 



E' — e 

E' — € 

T 

T—FT' 



T—FT' 



T' 


V — € 

T' — FT' 


T' — £ 

T' — e 

F 

F-id. 



F -(E) 




For the input string id+id *id, the moves made by predictive parser will be as follows 


STACK 


$E 

SET 

SE’T'F 

SE'T'id 

SET* 

$E‘ 

$E'T+ 

SET 

SE'T'F 

SE'T'id 

SET* 

SE'T'F* 

SE'T'F 

SET'id 

SET' 

SE* 

$ 


INPUT 


id + id * idS 
id + id * id$ 
id+ id * id$ 
id + id * id$ 
+ id * id$ 
+ id * idS 
+ id *id$ 
id * idS 
id * id$ 
id * id$ 

* id$ 

* id$ 
id$ 
id$ 

$ 

$ 

S 


OUTPUT 


E-»TE' 
T^ FT 



T- 

E- 
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The leftmost derivation for the example is as follows: 

E =>TE’ =>FT’E’ =>id T’E’ =>id E’ =>id + TE’ =>id + FT’E’ 

=> id + id T’E’ => id + id * FT’E’ => id + id * id T’E’ 

=> id + id * id E’ => id + id * id 

Parsing Table: 

The parsing table M comprises the entries of the form M[X,a] =UVW meaning that if the top of the 
stack holds X and the input symbol 'a' is read, then X should be replaced by UYW. 

• The construction of the parsing table is aided by two functions : FIRST and FOLLOW. 

• If OC j s an y string of grammar symbols, then FIRST ( a ) is the set of all terminals that begin 
the strings that can be derived from a . If a^e, then e is First(a). 

• For any non terminal A, FOLLOW(A) is the set of terminals 'a' that can appear immediately 

to the right of A in some sentential form. That is, there exist some rule of the form 
aAaP, for some a and p. 

• If A is right most symbol in some rule, then $ is also in FOLLOW(A) 


Computation of FIRST 

1. For all terminals 'a', First(a) = {a} 

2. For any non terminal X, if X—> e is a production rule, add e to First(X) i.e 
FIRST(X) =FIRST(X)U {e} 

3. If X is a non-terminal, for every rule of the form X—> YiY 2 ...Yk , update First(X) by 
the following rules: 

a. FIRST(X) =FIRST(X) U FIRST(Yi). 

b. For all l<=i<k, FIRST(X) =HRST(X) U FIRST(Yj) if e is in FIRST(Yj) where 

l<=j<i. If e is in FIRSTS) fkFIRST(Y 2 ) n. fl FIRST(Y k ) then FIRST(X) 

= HRST(X) U {e } 

Computation of FOLLOW 

Apply the following rules until nothing can be added to any FOLLOW set. 

1. Place $ in FOLLOW(S), where S is the start symbol and $ is the right end marker . 

2. If there is a production of the form > aBp, then every thing in FIRST(P) except e is 
placed in FOLLOW(B). 

3. If FIRST(P) in above case contains e or if there is a rule of the form > aB, then 
everything in FOLLOW(A) is in FOLLOW(B). 

Example: 

Consider the following grammar 
R—>idS l(R)S 
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S —> +RS I .RS I *S le 
Here, 

FIRST(R) ={ id, ( } 

FIRST(S) ={+,., * , e } 

FOLLOW(R) ={), + ,.,*,$} 

FOLLOW(S) ={), + ,.,*,$} from R —tidS using the third rule above 

Another example: 

E —» TE' 

E' —» + TE' | e 
T —» FT' 

T' —> * FT' | e 
F-» (E) | id 

FIRST(E) =FIRST(T) =FIRST(F) = {(, id } 

FIRST(E’) = {+,£} 

FIRST(T’) = { * , e } 

FOLLOW(E) ={), $ } since E is start symbol $ is follow ofE and ) comes in 
FOLLOW(E) from the production F — > ( E ) 

FOLLOW(E’) = FOLLOW(E) ={),$} from E -A TE' 

FOLLOW(T) ={) ,+,$} where ),$ comes from E —> TE' since e is in FIRST(E') 

+ comes from FIRST(E') 

FOLLOW(T') = FOLLOW(T) = ={) ,+,$} since T —>FT' 

FOLLOW(F) = { *,+,),$} where * comes from FIRST(T') in T' —> FT', others come from 
FOLLOW(T) in T—>FT' 


Example 

S i E t SS' | a 
S' -> eS | e 
E —> b 

The FIRST and FOLLOW: 
FIRST: 

First(S) = { i, a } 

First(S’) = { e, e } 

First(E) = { b } 


Follow(S) - Contains $, since S is start symbol 

• Since S -> iEt SS’ , put in First(S’) but not e 

• Since S’ => e, Put in Follow(S) 

• Since S’ -> eS, put in Follow(S’) So.... 
Follow(S) = { e, $ } 

• Follow(S’) = Follow(S) 

• Follow(E) = { t } 


Construction of PARSING TABLE : Algorithm 
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For each production A —>a 0 f t h e grammar do , 

1- For each terminal 'a' in FIRSTX^O’ a ^d A to M[A,aJ. 

2- (a)If e is in FIRST(^), add A —>(X to M[A,b] for every terminal't)' in 
FOLLOW(A). 

(b)If e is in HRST( a } and $ j s i n FOLLOW(A) then add A ->a t0 M[A,$]). 
3. Make every undefined entry of M to be ERROR. 

Example of parsing table for the grammar: 

R—>idS l(R)S 
S —> +RS I .RS I *S le 

The FIRST and FOLLOW as computed above are: 

LIRST(R) ={ id, ( } 

LIRST(S) ={+,. , * , e } 

LOLLOW(R) = {), + ,.,*,$} 

LOLLOW(S) = {), + ,.,*,$} 


The parsing table is as below: 


Non 

Terminal 

Inputs 

id 

( 

) 

+ 


* 

$ 

R 

R—>idS 

R—>(R)S 






S 



S —> e 

S —> +RS 

S —> e 

S —> .RS 

S —> e 

Sh>*S 

S —> e 

S —> e 


In the above example, 

• LIRST(R) contains id and ( so at the column with these inputs, corresponding productions of 
R has been entered. 

• Also the LIRST(S) contains {+,.,*,£ } So for the symbols +, . and * the corresponding 
production with the first being that symbol are entered in table. 

• The LIRST(S) contains e so for every symbols in LOLLOW(S) the corresponding 
productions of S are entered . 

In the above parsing table, some entries for M[X,a] is multiply-defined. If the grammar is 
ambiguous than the parsing table may contain the multiple entries for some M[X,a]. So construction 
the parsing table for the grammar, we can decide whether the grammar is ambiguous or not. 
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Constructing Parsing Table another Example: Grammar, FIRST and FOLLOW 


S i E t SS’ I a 

First(S) = { i, a } 

Follow(S) = { e, $ } 

S’ —» eS [ g 

First(S’)= { e, e } 

Follow(S’) = { e, $ } 

E —» b 

First(E) = { b } 

Follow(E) = { t } 


Construction of Parsing Table: 


S -> i E t SS’ 
First(iE t SS’)={i } 


S -> a E —> 1> 

First(a) = {a} First(b) = {!>} 


S’ eS S e 

First(eS) = {ej First(e ) = {e } Follow (S’) ={ e, $ } 


Non- 

terminal 

INPITT SYMBOL 

a 

b 

e 

i 

t 

$ 

s 

S -411 



S -nEtSS’ 





S’ 



S’—>eS 




E 


F —»b 






Constructing Parsing Table - Another Example 3 


Grammar: 


» TE’ 

4 + TE’ I e 
> FT’ 

-> * FT’ I e 
F -4 (E ) lid 


FIRST 

First(E,F,T) = { (, id } 
First(E’) = { +, e } 
First(T’) = {*,£} 


FOLLOW 
Follow(E) = {),$} 
Follow(E’) = {),$} 
Follow(F) = {*,+,),$} 
Follow(T) = {+,),$} 
Follow(T’) = {+,),$} 


Expression Example: E — > TE’ : First(TE’) = First(T) = { (, id } 
M[E, (] : E —> TE’ 

M[E, id ] : E —> TE’ By Rule 1 
E’ -4 +TE’ : First(+TE’) = + 
so M[E’, +] : E’ —> +TE’ By Rule 1 
T’ —> * FT’ : First(*FT') = * 
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so, M[T',*] : T’ —» * FT’ By Rule 1 
T —> FT’ : First(FT') =First(F) = {(, id} 
so M[T,(] : T —» FT’ 

M[T,id] : T —» FT’ By rule 1 
F—> (E) lid : First(F) = {(, id} so by rule 1, 

M[F,(] : F —» ( E ) 

M[F,id] : F —> id 

(by rule 2) E’ —» e : e in First(e ) T’ —» e : e in First(e ) 

M[E’,)] : E’ —» e (2.a) M[T’, +] : T’ —> e (2.a) 
M[E’, $]: E’ —> e (2.a) M[T’,)]: T’ —> e (2.b) 
M[T’, $]: T’ —» e (2.b) 


Non 

Terminals 

Inputs 

id 

+ 

* 

( 

) 

$ 

E 

E -»TE' 



E-»TE f 



E' 


E' —♦ +TE' 



E' — e 

E' — e 

T 

T—♦FT' 



T—♦FT' 



T' 


T' —♦ e 

T'-> FT' 


T' — e 

T' — e 

F 

F-»id 



F -(E) 




Exercise: Construct the Parsing table for the grammar below using FIRST and FOLLOW. 
S —» cAd 
A ab I a 


LL(1) Grammar: 

A grammar whose parsing table has no multiply-defined entries is said to be LL(1) grammar. The 
first L in LL(1) corresponds to reading the input left to right and second 'L' corresponds to the left¬ 
most derivation. The 1 in the parenthesis corresponds to a maximum lookahead of 1 symbol. 

S No ambiguous or left-recursive grammar is LL(1). 

S There are no general rules to convert a non LL(1) grammar into a LL(1) grammar. 

S The properties of LL(1) grammar are as below: 

In any LL(1) grammar, if there exists a rule of the form a I (3 where a and P, are distinct 
then, 

1. For any terminal 'a', if a is in FIRST(a) then a is not in FIRST(P) 

2. Either a => e or P => e , but not both. 

jc 

3. If P ==> e , then a does not derive any string beginning with the terminal in FOLLOW(A). 


Exercise: 

1. Show that the following grammar is ambiguous by constructing parsing table 
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S aSbS I bSaS I e 
Bottom Up Parsing: 

S Bottom up parsing attempts to construct a parse tree for an input string beginning at the 
leaves(the bottom) and working up towards root(top). 

S The process of replacing a substring by a non-terminal in bottom up parsing is called 
reduction. 

S The left-most reduction method corresponds to the right-most derivation of top-down 
parsing and the right-most reduction corresponds to the left-most derivation of top-down 
parsing. 

Consider an example grammar 

S—>aABe 
A—»Abc I b 
B —» d 

Now for string abbcde can be reduced to S by following steps 
abbcde 

aAbcde (replace b by A using A—> b) 
aAde (replace Abe by A using A—>Abc) 
aABe (replace d by B using B —> d ) 

S ( replace aABe by S using S—>aABe) 

A substring that can be replaced by a non-ter mi nal when it matches its right sentential form is 
called a Handle. So principle task of the bottom up parsing is to identify the handle that can be 
replaced by a non-terminal. Identifying the handle refers to the implementing the DFA that 
recognize the handle string. 

Shift-Reduce Parsing 

A simple bottom up parsing technique using a stack based implementation is shift-reduce parsing 
A convenient way to implement a shift-reduce parser uses a stack to hold the grammar symbols and 
an input buffer to hold the input string w. 

The method is described as: 

1. Initially stack contains only the sentinel $, and the input buffer contains the input string w$. 

2. While stack not equal to $S do 

a. While there is no handle at the top of stack, shift the input buffer and push the 
symbol onto stack. 

b. If there is a handle on top of stack, then pop the handle and reduce the handle with 
its non ter mi nal and push it on to stack. 

Example: 

For the above example string abbcde$ , the shift reduce actions may be, 

Stack Input Action 

$ abbcde$ 
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$a 

bbcde$ 

shift a 

$ab 

bcde$ 

shift b 

$aA 

bcde$ 

reduce A-»b 

SaAb 

cde$ 

shift b 

$aAbc 

de$ 

shift c 

$aA 

de$ 

reduce by A—>Abc 

$aAd 

e$ 

shift d 

$aAB 

e$ 

reduce by B-»d 

SaABe 

$ 

shift e 

$S 

$ 

reduce by S-»aABe 


Conflict during Shift-Reduce Parsing 

All grammars can not use the shift-reduce parsing since there may be conflicts during the parsing 
actions. For some grammars parser can not decide whether to shift or to reduce or can not decides 
which of several reduction to make, so the parsing action results in conflicts. 

There are two kinds of conflicts in this method. 

1. Shift-reduce conflict: The parser is not able to decide whether to shift or to reduce e.g. if 
A—>ab I abed are the productions of A and the stack contains $ab and the input buffer 
contains cd$, the parser cannot decide whether to reduce $ab by non-terminal A or to shift 
two more symbols before reducing. 

2. Reduce-Reduce Conflict: In this case the parser cannot decide which sentential form to 
use for reduction e. g. if A —>bc and B—>abc are the productions of grammar and the stack 
contains $abc, the parser cannot decide whether to reduce it to $aA or to reduce $B. 

The grammar that lead to the shift reduce parser into conflict are known as non-LR grammars 
Shift reduce parser can be built successfully for LR grammars and operator grammars. 

The operator grammar has the property that no production right side is e or has two adjacent 
terminals. 

e.g. E -»EAE I (E) I -E I id is not operator grammar. 

E —> E+E I E-E I E*E I (E) I -E lid is an operator grammar 

LR Grammars 

S The class of grammars called LR(k) grammars have the most efficient bottom up parser and 
can be implemented for almost any programming language. 

S The first L stands for left to right scan of input buffer, the second R for a right most 

derivation(left-most reduction) and the k stands for the maximum number of input symbols 
of lookahead used for making parsing decision. 

S If k is omitted, k is assumed to be 1. 

S The class of grammars that can be parsed using LR methods is a proper superset of the class 
of grammars that can be parsed with predictive parser. 
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■S It is difficult to write or trace LR parser by hand. Usually generators like Yacc(bison) is 
required to write LR parser. 

•S The LR parsing method is most general non-backtracking shift-reduce parsing method 
known. 

•S LL(1) grammar <z LR(1) grammar. 

LR-Parsers 

o covers wide range of grammars, 
o SLR - simple LR parser 
o LR - most general LR parser 

o LALR - intermediate LR parser (look-head LR parser) 

o SLR, LR and LALR work same (they used the same algorithm), only their parsing 
tables are different. 

Structure of LR Parser: 



An LR parser comprises of a stack, an input buffer and a parsing table that has two parts: action 
and goto. Its stack comprises of entries of the form s 0 Xi Si X 2 S 2 ... X m s m where every Sj is called a 
state and every Xi is a grammar symbol(terminal or non-terminal) 

If top of the stack is S m and input symbol is a, the LR parser consults action[s m ,a] which can be one 
of four actions. 

1. Shift s , where s is a state. 
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2. Reduce using production P 

3. Accept 

4. Error 

The function goto takes a state and grammar symbol as arguments and produces a state. The goto 
function forms the transition table for a DFA that recognizes the viable prefixes of the grammar. 

Note: Viable prefixes are the set of prefixes of right sentential forms that can appear on the stack of 
the parser, i. e. it is a prefix of right sentential form that doesnot continue past the end of right most 
handle of that sentential form. 

The configuration of the LR parser is a tuple comprising of the stack contents and the contents of 
the unconsumed input buffer. It is of the form 
(s 0 Xi si X 2 S 2 ... X m s m , a ; a i+ i... a n $) 

1. If action[s m ,aJ = shift s, the parser executes a shift move entering the configuration 
(s 0 Xi si X 2 s 2 ... X m s m aiS , ai+i... a n $) shifting both aj and the state s onto stack. 

2. If action[s m ,ai] = reduce P , the parser executes a reduce move entering the 
configuration (s 0 Xi Si X 2 S 2 ... X m _ r s m . r As , aj ai+i... a n $) . Here s = goto[s m - r A] and r is 
the length of handle p. 

Note: if r is the length of p then the parser pops 2r symbols from the stack and push the 
state as on goto[s m . r A] . After reduction , the parser outputs the production used on reduce 
operation. > P 

3. If action[s m ,aJ = Accept, then parser accept i.e. parsing successfully complete and stop. 

4. If action[s m ,aJ = error then call the error recovery routine. 


Construction of the Parsing Table for the SLR parser (Simple LR) 

• SLR parser are the simplest class of LR parser. Constructing a parsing table for action 
and goto involves building a state machine that can identify the handle. 

• For building a state machine, we need to define the three terms: item, closure and goto 

Item: An “item ” is a production rule that contains a dot(.) somewhere in the right side of the 
production. For example, a —>. aAfi, a OtAfi a (xA.fi, a —> aAfi are items if there is a 
production a~> (xAfi in the grammar. 

An item encapsulates what we have read until now and what we expect to read further from the 
input buffer. 


The Closure operation: 

If I is a set of items for a grammar G, then the closure® is the set of items constructed from I using 
the following rules: 
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1. Initially, every item in I is added to closure(I). 

2. If a —> aB/3 is in closure(I) and B y is a production, then add the item B y to I if 
it is not already there. 

Apply this rule until no more new items can be added to closure(I). 

Example: 

E’ —>F 
E->E+T I T 
T->T*F I F 
F—y(E) I id 

If I = {E’ —>.F} then closure(I) contains the following items: 

{ 

E’ —>.E, 

E->.E+T, 

E—>.T, 

T ->.T*F, 

T—>.F, 

F->.(E), 

F—^.id 

} 


Note: 

• E’ —> .E and all items whose dots are not at the left end(begining ofRSH) are called 
kernel items. 

• All items with dot at the left end are non-kernel items except E ’—> .E which is always 
known as kernel items. 

The goto operation: 

In any item I, for all production of the form » a.X|3 that are in I, 

Goto[I,X] is defined as the closure of all productions of the form aB.(3 
In the example above, if I 0 = closure({ E’—>.E} then 

goto[I 0 ,E] = Ii = closure({E’ —> E. , E—>E.+T } ) since the closure({ E’—>.E} is 
{ E’ —>.E, E—kE+T , E—>.T, T ,T*F, T-> .F, F-k(E), F->.id } 

Similarly, goto[Io,T] = closure({ E —» T., T —> T.*F }) 
goto(Io,F) =closure( {T —> F. }) and so on. 

The complete goto operation defines the DFA that identifies the handle. 
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Computing the Canonical LR(0) Collection 

To create the SLR parsing tables for a grammar G, we will create the canonical LR(0) collection of 
the grammar G\ 

Algorithm : 

C = { closure({S’— kS}) } 

repeat the followings until no more set of LR(0) items can be added to C. 
for each I in C and each grammar symbol X 
if goto(I,X) is not empty and not in C 
add goto(I,X) to C 


The goto function is a DFA on the sets in C 

Example: Compute the Canonical LR(0) items collection for the following grammar. 

E->E+T I T 
T->T*F I F 
F—>(E) I id 

Solution: The augmented grammar is , 

E’->E 
E—FE+T I T 
T—FT*F I F 
F—>(E) I id 


I 0 = closure( ={ 

E’ ->.E, 

E—^.E+T, 

E—>.T, 

T —>.T*F, 

T—>.F, 

F—>.(E), 

F —>.id 

} 

goto[Io,E] =closure({E’ —FE., E—FE.+T}) ={E’ —FE., E—FE.+T} = 7/ 
goto[I 0 T] -closure({E —FT.,T ->T*F}) = {E—FT.,T ->T*F} = I 2 
goto[I f ,F] =closure({T—>F.}) ={T F.} =I 3 

goto[I 0 ,(] =closure({F —H.E)}) 

={ F —>(.E), E —>.E+T, E ->.T, T->.T*F, T->.F,F ->.(E) ,F ->.id}=I 4 
goto[I 0 ,id] =closure({F —Fd.}) =I 5 
goto[I h +] = closuredE —>E+.T} ) 

={E ->E+.T,T —>.T*F, T—>.F, F->.(E), F->.id}=I 6 
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goto[I 2 ,*] = closure({T ->T*F}) = {T->T*.F, F->.(E), F->.id}=I 7 
goto[I 4 ,E] = closure({F ->(E.), E ->E.+T'}) = {F ->(E.), E ->E.+T}= I 8 
goto[l 4 ,T] = closure({E —> T., T ->T*F}) = {E ->T., T —>T.*F}= I 2 
gotolUF] - closure({T —>F.}) = {T —>F.} -1 3 
goto[I 4 ,(] = closure({F ->.(E)} = I 4 
goto[I 4 ,id] = closure({F —>id.}) =I 5 

goto[I 6 T] = closure ({E ->E+T., T ->T*F}) ={E ->E+T., T->T*F} = I 9 

goto[I 6 F] = closure ({T->F.}) = I 3 

goto[If,id]= closure({F —yid.}) =Is 

goto[I 7 F]= closure({T ->T*F.}) = {T->T*F.} = I 10 

goto[ I 7 ,(]= closure({F —>.(E)} = I 4 

goto[I 7 ,id] = closure({F —>id.}) =Is 

goto[I 8 ,)] = closure({F ->(E).}) - { F ->(E).} =I n 

goto[I 8 ,+] = closuredE —>E+.T} ) =I 6 

goto[I 9 ,*] = closure({T -> T*F}) = { T -> T*F, F->.(E), F->.id}=I 7 


The transition diagram of the goto function : DFA 
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Construction of SLR parsing Table 

To construct the action and goto parts of a SLR parsing table , use the following algorithm. 

1. Given the grammar G, construct an augmented grammar G’ by introducing a production of 
the form S’ —>S where S is the start symbol of the grammar G. 

2. Let I 0 = closure({S’ —>.S}) , starting from I 0 construct SLR (canonical collection of sets of 

LR(0) )items for G’ using closure and goto: C<—{Io,...,I n } 

3. State i is constructed from L . then the parsing actions for state i are defined as follows: 

a. If goto[L,a] =Ij, set action[i,a] = “ shift j” , here a must be a termial. 

b. If I; has a production of the form > a. , then for all symbol in FOLLOW(A), set 
action[i,a] = “ Reduce a” , here A should not be S’ 

c. If [S’ S.] is in L , then set action[i,$] = Accept. 

4. If goto[I;,A ] = Ij, where A is non terminal, then set goto[i,A] = j 

5. For all blank entries not defined by step 2 to 3, set error. 

6. The start state s 0 corresponds to I 0 = closure!{S ’}). 

• For any conflicting actions are generated by the above rules, we say the grammar is not 
SLR. The algorithm fails to produce a parser in this case. 

• The parsing table constructed by this algorithm is SLR(l) parsing table of G. 

• An LR parser using this SLR(l) table is called SLR(l) parser and simple called as SLR 
parser. 

Example: Construct the SLR parsing table for the following grammar. 

E->E+T I T 
T-FT*F I F 
F—>(E) I id 

Solution: The augmented grammar G’ is: 

E’->E 
E—FE+T I T 
T—FT*F I F 
F—>(E) I id 


The canonical collection of set of LR(0) items for this grammar are: 


It).' E’ —>.E 


I 2 : E—FT. 

T ->T*F 


I 5 : F—xd. 


I 8 : F-XE.) 

E->.E+T 



I 6 : E—^E+.T 


E ->E.+T 

E->.T 


Is: T->F. 




T —>.T*F 



T ->.T*F 








T->.F 

F—>.(E) 


U F->(.E) 

E ,E+T, 


T->.F 

F —>.(E) 

F —>.id 


I 9 : E ->E+T. 

T —>T*F 

F—>.id 


E ->.T, 







I 10 : T -> T*F. 

h: E’ —FE. 


T —>.T*F 


I 7 : T->T*F 
F—>.(E) 
F—>.id 



T —>.F 



E—FE.+T 


F ->.(E) 

F ->.id 



In: F —>(E). 
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Now computing the entry for the SLR parsing table for action table, 

Consider the set of items 
For I 0 : 

The item F—>.(E) gives to the entry for action[0,(] = shift 4 
The item F—».id gives to the entry for action [ 0 ,id] = shift 5 

The other items in I 0 yield no action. Similarly 

For L: 

action[l,$] = accept since E’ —)E. is in Ii 
action[l,+] = shift 6 

For I 2 : 

action[2,*] = shift 7 since T -^T.*F is in I 2 

FromE -*T., FOLLOW(E) = { $,+,)} so action[2,$] =action[2,+]=action[2,)]=reduce E -*T 
For I 3 : 

T F. , so FOLLOW(T) = {*,$,+,)} 

So, action[3,*] = action[3,$] = action[3,+] = action[3,)] = reduce T -*F 
For I 4 : 

action[4,( ] = shift 4 
action[4,id ] = shift 5 

For I 5 : 

FOLLOW(F) = {*,$,+,)} 

So, action[5,*] = action[5,$] = action[5,+] = action[5,)] = reduce F-^id 

Similarly, 

For I 6 : 

action[ 6 ,(] = shift 4 
action[ 6 ,id] = shift 5 

For I 7 : 

action[7,(] = shift 4 
action[7,id] = shift 5 

For I 8 : 

action[ 8 ,)] = shift 11 
action[ 8 ,+] = shift 6 

For I 9 : 

E -*E+T. , FOLLOW(E) ={ $,+,)} 

So, action[9,$] =action[9,+]=action[9,)]=reduce E —>E+ T and 

action[9,*] = shift 7 
For Ii 0 : 

T -> T*F. , FOLLOW(T) = {*,$,+,)} 

So, action[10,*] = action[10,$] = action[10,+] = action[10,)] = reduce T -^T*F 
For In: 

F ->(E). , FOLLOW(F) = {*,$,+,)} 

So, action[ll,*] = action[ll,$] = action[ll,+] = action[ll,)] = reduce F —>(E) 
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Now the action table for SLR parsing for above grammar is: 


States 

Terminals 

id 

+ 

* 

( 

) 

$ 

0 

shift 5 



shift 4 



1 


shift 6 




accept 

2 


E -*T 



E -*T 

E ->T 

3 


T -*F 

T -*F 


T-*F 

T-*F 

4 

shift 5 



shift 4 



5 


F-*id 

F-*id 


F-*id 

F-*id 

6 

shift 5 



shift 4 



7 

shift 5 



shift 4 



8 


shift 6 



shift 11 


9 


E-X+T 

shift 7 


E->E+T 

E->E+T 

10 


T ->T*F 

T —>T*F 


T ->T*F 

T ->T*F 

11 


f -m 

F -HE) 


F -HE) 

F -ME) 


Now goto table for SLR. 


States-* 

E 

T 

F 

Variable^ 
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Using the above parsing table, the SLR Parser takes the following moves to parse the string 


idHd+id 

Stack 

Input 

Action 

Output 

0 

id*id+id$ 

shift 5 


0id5 

*id+id$ 

reduce by F—>id 

F->id 

0F3 

*id+id$ 

reduce by T—>F 

T—»F 

0T2 

*id+id$ 

shift 7 


0T2*7 

id+id$ 

shift 5 


0T2*7id5 

+id$ 

reduce by F—>id 

F—>id 

0T2*7F10 

+id$ 

reduce by T—>T*F 

T—>T*F 

0T2 

+id$ 

reduce by E—>T 

E—>T 

0E1 

+id$ 

shift 6 


0E1+6 

id$ 

shift 5 


0El+6id5 

$ 

reduce by F—>id 

F->id 

0E1+6F3 

$ 

reduce by T—>F 

T—>F 

0E1+6T9 

$ 

reduce by E—>E+T 

E^E+T 

0E1 

$ 

accept 



Properties of SLR grammars 

Every SLR(l) grammar is unambiguous. But there exist certain unambiguous grammar that are not 
SLR(l). In such a grammars there exists at least one multiply defined entry action[i,a] which 
contains both shift directive and reduce directive. 

Invalid reduction: 

In SLR parser , in any state i, a reduction A—> a is performed on input symbol ‘a’ if state i 
contains [A—> a.] and ‘a’ is in FOLLOW(A), however not all symbols in FOLLOW(A) can be 
reduced in such a fashion. 


For example, consider following grammar. 



23 


-HGC 




B.Sc-CSIT 


Compiler Design Construction 


Syntax Analysis 


Now: while construction the parsing table, consider the item set h: 

goto[l 2 ,=] = h> this will make entry in SLR parsing table as action[2,=] = shift 6 

also R—> L. is also in I 2 , so by the rule of constructing SLR parsing table , we compute 
FOLLOW(R) = { =,$} this will make entry to SLR parsing table as action[2,=] = reduce R —> L. 

So the entry on action table for action[2,=] is multiply defined, one is shift operation and another is 
reduce operation. This leads the SLR parser into shift-reduce conflict for the state 2 and input ‘=\ 

The grammar is not ambiguous but there is shift-reduce conflict. So the SLR parser is not enough 
powerful to remember left context to decide what action the parser should take on input ‘=\ 

LR(k) items 

In order to avoid invalid reductions, the general form of an item is of the form [A—> a.(3,aj i.e. 
extra information is put into a state by including a terminal symbol as a second component in an 
item. In this case, the second term a has no effect when (3 is not empty, but in an item of the form 
[A—> a., a], reduce action is performed using form [A—> a] only if the next input symbol is ‘a’. 

Such an item is called a LR(1) item where the input symbol ‘a’ is called the “lookahead” whose 
length is 1. 

The construction of the canonical collection of the sets of LR(1) items are similar to the 
construction of the canonical collection of the sets of LR(0) items, except that closure and goto 
operations work a little bit different. 

Computation of closured) and LR(1) items. 

1. Repeat 

2. For each item of the form [A—> a.B/3,a] in /, 

each production of the form B in G’ and 

each terminal ‘h’ in FIRST(/3a) do 

add [B —> ,y,b] to I if it is not already there. 

3. Until no more items can be added to I. 

Computation of gotoH-Xl for LR(1) items 

Given the set of all items of the form [A—> aX/3,a] in I, 
goto[I,X] = closure({A—> aX.fra}). 

Computation of LR(1) DFA 

1. Start with C = (colsure({S’ —> .S,$}) where S is start symbol. 

2. Repeat 

3. For each set of items I in C and each grammar symbol X such that goto [I,X] is not already 
in C, do, 

Add goto[I,X] toC. 

4. Until no more sets of items can be added to C. 
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Example: Compute the LR(1) collection of items from the following grammar. 
S ->CC 
C—xC I d 

Solution: The augmented grammar is 
S’^S 
S -+CC 
C—>cC I d 


Io: closure({(S’ —> • S, $)}) 
={ (S’ -> • S, $) 

(S -> • C C, $) 

(C -> • c C, c/d) 
(C->»d, c/d)} 


Ii: goto(I 0 , S) 

= (S’ -> S • , $) 


I 2 : goto(I 0 , C) = 
(S -> C • C, $) 
(C -> • c C, $) 
(C -> • d, $) 


I 3 : goto(I 0 , c) = 

(C -> c • C, c/d) 
(C^*cC, c/d) 
(C -> • d, c/d) 


Note: In first case [S’ —> • S, $] is of the form [A—> a.Bj3,a] where /3 is empty and a = It 

has to be added the item [B —> :y,b] for each terminal ‘b’ in FIRST(jBa) which is equal to '$’ 

So add S —> • C C, $) 

Similarly for case S —» • C C, $), (J3a) = (C$) and FIRST(C$) ={c, d} so it has to be added the 
items 

C-fcC,c/d 

C-*d,c/d 

The Core of LR(1) Items 

The core of a set of LR(1) Items is the set of their first components (i.e., LR(0) items). For example 
the core of the set of LR(1) items 
{ (C -> c • C, c/d), 

(C -> • c C, c/d), 

(C -> • d, c/d)} 
is 

{ C —> c • C, 

C -> • c C, 

C —> • d 

} 


I4: goto(I 0 , d) = 
(C -> d •, c/d) 


I 5 : goto(I 2 , C) = 
(S -» C C •, $) 


I 6 : goto(I 2 , c) = 
(C -> c • C, $) 
(C -> • c C, $) 
(C -> • d, $) 


I 7 : goto(I 2 , d) = 
(C -» d •, $) 


I 8 : goto(I 3 , C) = 

(C -> c C •, c/d) 


goto(I 3 , c) = I 3 
goto(I 3 , d) = I4 


I 9 : goto(I 6 , C) = 
(C -> c C •, $) 


goto(I 6 , c) = I 6 
: goto(I 6 , d) = I 7 
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Construction of the LR(1) parsing table 

To construct the action and goto parts of a LR(1) parsing table , use the following algorithm. 

1. Given the grammar G, construct an augmented grammar G’ by introducing a production of 
the form S’—*S where S is the start symbol of the grammar G. 

2. Let I 0 = closure({S’ —>.S,$}) , starting from I 0 construct LR(1) (canonical collection of sets 

of LR(1) )items for G’ using closure and goto: C<— {Io,...,I n } 

3. State i is constructed from I;. then the parsing actions for state i are defined as follows: 

a. If goto[Ii,a] =Ij, set action[i,a] = “ shift j” , here a must be a ter mi nal. 

b. If I; has a production of the form [> a. ,a], set action[i,a] = “ Reduce a” , 
here A should not be S’ 

c. If [S’ —>S.,$] is in f , then set action[i,$] = Accept. 

4. If goto[Ii,A ] = Ij, where A is non terminal, then set goto[i,A] = j 

5. For all blank entries not defined by step 2 to 3, set error. 

6. The start state so corresponds to Io = closure({S’ —>.S,$ }). 


Now the parsing table for the grammar given above as 

1. S ’^>S 

2. S^CC 

3. C—k;C 

4. C -*/ 


will be as: 


Terminals/ 

States 

action 

goto 

c 

d 

$ 

S 

C 

0 

S3 

s4 


1 

2 

1 



accept 



2 

s6 

S7 



5 

3 

S3 

s4 



8 

4 

R3 

R3 




5 



R1 



6 

s6 

s7 



9 

7 



R3 



8 

R2 

R2 




9 



R2 




Here R2 means Reduce by Production 2 of grammar above 
i.e. Reduce S —>CC and so on. 

Si means shift i i.e. s3 refers as “shift 3” 
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LALR Grammar 


LALR(Lookahead LR) grammars are midway in complexity between SLR and canonical 
LR(most complex) grammars. Ther perform a “generalization” over canonical LR itemsets. 

A typical programming language generates thousands of states for canonical LR parser while 
they generate only hundreds of states for SLR and LALR. So it is much easier and economical 
to construct the SLR and LALR parser. 

Union operation on items. 

• Given an item of the form [A—> cc.B/3,a] , the first part of the item A a.B/3 is called 
the “core” of the item. 

• Given two states of the form I; =A—> a., a and Ij = A—> a,b the core of I; and Ij is same 
only the difference is the second part of the item. So the union of these two items is 
defined as 

lij =([A—> a,a/b]}. 

• The lij will perform the reduce operation on seeing either ‘a’ or ‘b’ on input buffer. The 
state machine resulting from the above union operation has one less state. 

• If /; and Ij have more than one items, then the set of all core elements in f should be the 
same as the set of all core elements in Ij for the union operation to be possible. 

• Union operation do not create any new shift-reduce conflicts but can create new reduce- 
reduce conflicts. Shift operation only depends on the core and not on the next input 
symbol. 

Construction of the LALR parsing table 

To construct the action and goto parts of aLALR parsing table , use the following algorithm. 

1. Given the grammar G, construct an augmented grammar G’ by introducing a production of 
the form S’ —>S where S is the start symbol of the grammar G. 

2. For each core present; find all sets having that same core; replace those sets having same 
cores with a single set which is their union. C={Io,...,I n } ^ C’={Ji,...,J m } where m 
<n 

3. Create the parsing tables (action and goto tables) same as the construction of the parsing 
tables of LR(1) parser. 

If J=L u ... u Ik since L,...,Ik have same cores 

4 cores of goto(Ii,X),...,goto(l 2 ,X) must be same. 

4. So, goto(J,X)=K where K is the union of all sets of items having same cores as goto(Ii,X). 

If no conflict is introduced, the grammar is LALR(l) grammar. The above algorithm is 
inefficient since it constructs the entire canonical DFA before generating the LALR 
parsing table. 

Example: 

Compute the LR(1) collection of items from the following grammar, 

S ->CC 
C->cC\d 
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Solution: The augmented grammar is 
S’^S 

s ->cc 

C—>cC I d 

The canonical LR( 1 ) items computed from this grammar are { Io,Ii,...,I n } as computed in 
previous LR(1) parsing table. 

In the collection of LR(1) items, I 3 and 1 6 , I 4 andl 7 , Is and I 9 have same core items 
respectively. So performing union operations, the items for LALR will be as 


Io: closure({(S’ —> • S, $)}) 
={ (S’ ->• S, $) 

(S -» • C C, $) 
(C->*cC, c/d) 

(C —> • d, c/d)} 


Ii: goto(I 0 , S) 

= (S’ -> S • , $) 


I 2 : goto(I 0 , C) = 
(S -» C • C, $) 
(C^»cC,$) 
(C • d, $) 


I 36 : goto(I 0 , c) = 

(C c • C, c/d/$) 
(C -> • c C, c/d/$) 


I47: goto(I 0 , d) = 

(C -> d •, c/d/$) 


I 5 : goto(I 2 , C) = 
(S -»C C •, $) 


I89: goto(I 3 , C) = 

(C -> c C •, c/d/$) 


Now, 

goto[I 3 6 ,C] = Ig 9 , since goto[I 3 ,C] = 1 8 in original set of LR(1) items. Similarly goto[I 2 ,c] = 
I 3 6 , since goto[I 2 ,c] = hs in original LR(1) items and so on. 


So the LALR parsing table for the grammar has 3 less states as 


Terminals/ 

States 

action 

goto 

c 

d 

$ 

S 

C 

0 

s36 

s4 


1 

2 

1 



accept 



2 

s36 

s7 



5 

36 

s36 

s4 



89 

47 

R3 

R3 

R3 



5 



R1 



89 

R2 

R2 

R2 




N ow for input ccd the parser takes the following steps in LR(1) pars er 


Stack 

inputbuff 

action 

output 

$0 

ccd$ 

shift 3 


$0c3 

cd$ 

shift 3 


$0c3c3 

d$ 

shift 4 


$0c3c3d4 

$ 

Error 
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In LALR ttsjjilgiabove tab%,_ 


Stack 

inputbuff 

action 

output 

$0 

ccd$ 

shift 36 


$0c36 

cd$ 

shift 36 


$0c36c36 

d$ 

shift 47 


$0c36c36d47 

$ 

reduce C—Hi 

C—Hi 

$0c36c36C89 

$ 

reduce C—>cC 

C—>cC 

$0c36C89 

$ 

reduce C —>cC 

C—>cC 

$0C2 

$ 

Error 



So for both LR(1) and LALR parser leads to an Error for string ccd. 
Now consider for string cdd 


LR(1) parser 

LALR Parser 

Stack 

Input 

Action 

output 

Stack 

Input 

Action 

output 

$0 

cdd$ 

shift 3 


$0 

cdd$ 

shift 36 


$0c3 

dd$ 

shift 4 


$0c36 

dd$ 

shift 47 


$0c3d4 

d$ 

Red. C—>d 

C->d 

$0c36d47 

d$ 

Red . C—>d 

C—>d 

$0c3C8 

d$ 

Red. C-»cC 

Ch>cC 

$0c36C89 

d$ 

Red. C—>cC 

Ch>cC 

$0C2 

d$ 

shift 7 


$0C2 

d$ 

shift 47 


$0C2d7 

s 

Red . C—>d 

C—>d 

$0C2d47 

s 

Red . C->d 

C->d 

$0C2C 

$ 

Red. Sh>CC 

Sh>CC 

$0C2C5 

$ 

Red. Sh>CC 

Sh>CC 

$0S1 

$ 

Acctpt 


$0S1 

$ 

Accept 



For the parsing of valid string of grammar we saw that both LR(1) and LALR take same actions for 
parsing. But for invalid string like ccd, the LR(1) parser leads to error after some shift action where 
LALR proceeds some reductions after LR parser has detected an error. But finally, LALR also 
discovered the error. 

Kernel and non-kernel items 

In order to devise a more efficient way of building LALR parsing tables, we define kernel and non- 
kernel items. Those items that are either the initial item [S'—>.S,$] or the items that have somewhere 
dot(.) other than the beginning of the right side are called the kernel items. No item generated by a goto 
has dot at the left end of production. Items that are generated by closure over kernel items hav a dot at 
the beginning of production. These items are called non-kernel items. 
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Chapter 5: Syntax Directed Translation 

In any programming language, grammar symbols are associated with attributes to associate information 
with the language construct that they represent. An attribute may hold almost any thing - a string, a 
number, a memory location, a complex record. 

The values of these attributes are evaluated by the semantic rules associated with the production of the 
grammar. The evaluation of these semantic rules may generate intermediate code, put information on the 
symbol table, perform type checking, issue error message etc. When we associate semantic rules with 
productions, we use two notations: 

- Syntax-Directed Definitions 

- Translation Schemes 

Syntax directed definitions: A syntax directed definitions are high level specification for translations. 
They hide many implementation details such as the order in which translation takes place. We associate a 
production rule with a set of semantic actions and we do not say when they will be evaluated. 

Translation Schemes: The translation schemes indicates the order in which semantic rules are to be 
evaluated. In other words, translation schemes give a little bit information about implementation details, 
using dependency graph. So they allow some implementation details to be shown. 


A syntax-directed definition is a generalization of a context-free grammar in which: 

- Each grammar symbol is associated with a set of attributes. 

- This set of attributes for a grammar symbol is partitioned into two subsets called 

• synthesized and 

• inherited attributes of that grammar symbol. 

- Each production rule is associated with a set of semantic rules. 

- Semantic rules set up dependencies between attributes which can be represented by a 
dependency graph. 

This dependency graph determines the evaluation order of these semantic rules. Evaluation of a semantic 
rule defines the value of an attribute. But a semantic rule may also have some side effects such as printing a 
value. 

The value of synthesized attributes of a node are determined by the children of the node where as the value 
of inherited attributes of node are determined by the parent and siblings of the node. To determine the 
attributes of the nodes in parse tree, we annotate the parse tree. 


Annotated Parse Tree: 


• A parse tree showing the values of attributes at each node is called an annotated parse tree. 

• The process of computing the attributes values at the nodes is called annotating (or decorating) of 
the parse tree. 

• Of course, the order of these computations depends on the dependency graph induced by the 
semantic rules. 

The conceptual view of syntax directed translation is as: 
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In syntax directed definition, each grammar production A—>a has associated with it a set of semantic rules 
of the form b= f(ci,c 2 ,c 3 .... c n ) where f is a function and b can be one of the followings: 

1. b is synthesized attribute of A and Ci,c 2 ,c 3 .... c n are attributes of the grammar symbols of the 
production (k—xx ) 

2. b is an inherited attribute of one of the grammar symbols on the right side of the production and 
Ci,c 2 ,c 3 .... c n are attributes of the grammar symbols in production (k—xx ) 

Attribute Grammar: 

• So, a semantic rule b=f(c h c 2 ,...,c„) indicates that the attribute b depends on attributes c h c 2 ,..., c„. 

• In a syntax-directed definition, a semantic rule may just evaluate a value of an attribute or it may 
have some side effects such as printing values. 

• An attribute grammar is a syntax-directed definition in which the functions in the semantic rules 
cannot have side effects (they can only evaluate values of attributes). 


Functions in semantic rules will often be written as expressions. Below is the example of an syntax directed 
definitions 


Production 
L —> E return 
E —► Ei + T 
E —► T 
T —> Ti * F 
T —► F 
F —► ( E ) 

F —> digit 


Semantic Rules 

print(E.val) 

E.val = Ei.val + T.val 

E. val = T.val 

T.val = Ti.val * F.val 
T.val = F.val 

F. val = E.val 
F.val = digit.lexval 


Here, Symbols E, T, and F are associated with a synthesized attribute val. 

The token digit has a synthesized attribute lexval (it is assumed that it is evaluated by the lexical 
analyzer). 

In STD, terminals are assumed to have synthesized attributes only and they usually supplied by the 
lexical analyzer. 

S-attributed definition: 

A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed 
definition. A parse tree for S-attributed definition can always be annotated by evaluating the semantic 
rules for attributes at each node in bottom up manner. The evaluation of s-attributed definition is based on 
the Depth first Traversal of the annotated tree. 
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F.val=4 


digit. lexval=4 


digit. lexval=5 digit. lexval=3 

The parse tree with annotation for an input expression of the grammar as : 5+3*4 ret is. 


Input: 5+3*4 



F.val=5 

T 

digit. lexval=5 
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E.val=17 return 


E.val=5 


T.val=12 


T.val=5 


/T.val=3 * F.val=4 


(digit. lexval=5 


Figure: The DFT of annotated graph 


F.val=3 \ i digit.lexval=4 ; 


\ digit. lexval=35 


Semantic Rules 


E.loc=newtemp(), E.code = Encode II T.code II add Ei.loc,T.loc,Edoc 
E.loc = T.loc, E.code=T.code 


T.loc=newtemp(), T.code = Tj.code II F.code II mult Ti.loc,F.loc,T.loc 
T.loc = F.loc, T.code=F.code 


F.loc = E.loc, F.code=E.code 


F.loc = id.name, F.code=“” 


Symbols E, T, and F are associated with synthesized attributes loc and code. 


The token id has a synthesized attribute name (it is assumed that it is evaluated by the 
lexical analyzer). 


• It is assumed that II is the string concatenation operator. 

Construction of syntax Tree: 


The syntax directed definitions can be used to specify the construction of syntax trees and other graphical 
representation. The syntax tree is an intermediate representation of the expression. 
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A syntax tree is a condensed form of parse tree useful for representing language constructs. The production 
S —> if B then Si else S 2 .might appear in syntax tree as 


if-then-else 



In a syntax tree , operators and keywords do not appears as leaves and they appear in interior nodes 
that will be the parent of those leaves in the parse tree. Following is the syntax tree for the 
expression 5+3 *4 will be as 




4 


Syntax directed translation can be based on syntax trees as well as parse trees. The approach is 
same for both case. We attach attributes to the nodes as in parse tree. For construct the syntax tree 
for the language construct , the syntax directed definitions should be defined to create the nodes , 
assign attributes and link those node to other nodes. 

Construction of the syntax tree is similar to the translation of the expression into postfix form. The 
sub-tree is constructed for each sub-expression by creating a node for each operator and operand. 
The children of an operator node are the 

Each node in a syntax tree is a record with several necessary fields. If a node is operator then it has 
the pointer field to point its children(operands). So there are two types of nodes : interior nodes 
(operator nodes) and leaf node. 

Consider the following grammar with associated semantic rules: 


>E t + T 

E.val = Ei.val + T.val 

► Ei-T 

E.val = Ei.val - T.val 

T 

E.val = T.val 

►(E) 

T.val = E.val 

► id 

T.val = id. entry 

> num 

T.val = num.val 
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The syntax directed definition to create the parse tree for the expression must execute the 
implementation of the following functions. 

1. makenode(op,left,right): To create an operator node with label op and two fields 

containing pointer left and right. 

2 . makeleafiid,entry): creates an identifier node with label id and a field containing entry, a 
pointer to the symbol table entry for that id. 

3. makeleaf (num,val): creates a node with label num and a field containing val, the value of 
the number. 

The syntax directed definition for construction of syntax tree using these semantic rule will be as: 


Production 

Semantic rules 

E^Ej + T 

E.ptr = makenode(‘+’,E].ptr, T.ptr 

E —> Ej - T 

E.ptr = makenode(‘-’,Ei.ptr, T.ptr 

E—> T 

E.ptr = T.ptr 

T—>(E) 

T.ptr = E.ptr 

T —>■ id 

T.ptr = makeleaf(id,id.entry) 

T —> num 

T.ptr = makeleafi num, num.val) 


So for expression 3+5-4, sequence of execution of the following fragments will create the syntax 
tree 


1. pi= makeleaf(num,3) ; 

2. p 2 =makeleaf(num,5) ; 

3. p 3 =makeleaf(num, 4 ); 

4 . p 4 =makenode ( ' + ' , pi, p 2 ) ; 

5 . p 5 =makenode ( ', p 4 ,P 3 > ■ 

The syntax tree created above will be as: 



Directed Acyclic graphs for expression: 

A directed acyclic graph(DAG) for an expression identifies the common sub-expressions in the 
expression. Like a syntax tree, a dag has a node for every sub-expression of the expression. An 
interior node represents operator and its children represent the operands. The only difference 
between syntax tree and DAG is that a node representing common sub-expression has more than 
one parent in the syntax tree. 
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For example, consider the following expression: 

a + a * (b - c) + (b - c) * d 

The syntax tree and DAG is as shown below: 



Syntax Tree Directed Acvclic Graph 

The sequence of instruction for creating the DAG of above expression using the makenode and 
makeleaf is as below: 


1. pi =makeleaf(id,a); 

2. p2 =makeleaf(id,b); 

3. p3 = makeleaf(id,c) 

4. p4 = makeleaf (id,d); 

5. p5 =makenode(‘-‘, p2,p3); 


6. p6=makenode(‘*’,pl,p5); 

7. p7 = makenode(‘+ \pl,p6), 

8. p8 = makenode(‘*’,p5,p4); 

9. p9 = makenode(‘+ \p7,p8), 


Inherited Attributes: 

An inherited attribute at any node is defined based on the attributes at the parent and/or siblings of 
the nodes. Inherited attributes are useful for describing contest sensitive behavior of grammar 
symbol. For example, an inherited attribute can be used to keep track of whether an identifier 
appears at the left or right side of an assignment operator. This can be used to decide whether to 
use the value of the identifier or its 1-value. 

Example of inherited attributes: 

The following example shows the distribution of type information to the various identifiers in a 
declaration. A declaration generated by a non-ter mi nal D in syntax directed definition consists of 
keywords int or real followed by list of identifiers ‘L’ 

The non terminal T has a synthesized attribute type whose value is determined by the keyword in 
declaration. 

The syntax directed definition for the declaration is: 

7 


-HGC 



B.Sc. CSIT 


Compiler Design and Construction 


D -tTL 

{L.in = T.type} 

T —tint 

{ T.type = integer} 

T —treat 

{ T.type = real} 

/. —t /. /, id 

{ Li.in = L.in} 


{addtype(id.entry, L.in) } 

L —tid 

{ addtype(id.entry, L.in) } 


The semantic rule L.in = T.type , associated with the production D —> TL sets inherited attribute 
L.in to the type in declaration. Then the rules pass this type down the parse tree using the inherited 
attribute L.in. 

The rules associated with the production for L call procedure addtype() to add the type of each 
identifier to its entry in the symbol table( pointed by attribute entry). 

The annotated parse tree for the input string : int idi,id 2 ,ich is as shown below. 


D 



L.in = int , ld 3 



L.in =int , ld 2 


/ 

Idi 


• The above annotated parse tree shows the inherited attribute L.in at each node labeled L. 

• At first level of tree, attribute type is inherited from the sibling T.type where int is 
synthesized attribute of node labeled T.type. 

• In another level, the nodes labeled as L.in has the attributes inherited from the parent 
node(top-down) 

• At each L nodes we also call procedure addtype() to insert into the symbol table for the fact 
that the identifier at the right child of this L-node has type int. 


The dependency graph: 

In order to correctly evaluate attributes of syntax tree nodes, a dependency graph is is useful 
tool. If an attribute b of a node in a parse tree depends on an attribute c, then the semantic rule 
for b at that node must be evaluated after the semantic rule that defines c. A dependency 
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graph is a directed graph that contains attributes as nodes and dependencies across attributes as 
edges. 

For example 

E —> Ei + E 2 { E.val = Ei.val + E 2 .val } 

The parse Tree is: The De P endenc y Gra P h 



Algorithm for dependency graph 

For each node n in the parse tree do begin 

For each attribute ‘a’ of the grammar symbol n do begin 
Construct a node in the dependency graph for a. 

End 

End 

For each node n in parse tree do begin 

For each semantic rule of the form b- f(ci,C 2 ,C 3 .... c n ) associated with the 
production at n do begin 

For i=l ton do begin 

Construct an edge from Ci to b 

End 

End 

End 

e.g. Suppose A.a = f(X.x, Y.y) is a semantic rule for the production X —> XY which defines 
synthesized attribute A.a that depends on the attributes X.x and Y.y, For the parse tree for this 
production, there will be 3 nodes A.a, X.x , Y.y in the dependency graph with an edge to A.a from 
both X.x and Y.y as 



Dependency graph for A.a =f(X.x,Y.y) 
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If the production A —> XY has semantic rule X.x = g(A.a,Y.y) then there will be edge from A.a 
and Y.y to X.x since X.x depends on Y.y and A.a as: 

• A.a 


■y 

»,Y.y) 

For the grammar for declaration , the semantic rule applies in dependency graph for string 
D 

Int idi,id 2 ,id 3 

T.type • 

int 

Here Dotted line shows the parse tree 
Directed solid line shows the 
dependency hence dependency graph 




Dependency in above graph shows: 

- L.in =T.type, L.in = L.in 

- For semantic rule addtype(id.entry, L.in) associated with the L-Production, leads to 
creation of a dummy attribute (di d 2 ,d 3 in graph 

Exercise: A grammar for declaration is given as 

D —> id L 
L —> ,id L l:T 
T —> integer I real 

Construct a syntax directed definitions to enter the type of each identifier into symbol table using 
synthesized and inherited attributes. Also construct annotated parse tree for input string as : 

idiid 2 ,id 3 : integer and construct the dependency graph for the same. 


10 


-HGC 



B.Sc. CSIT 


Compiler Design and Construction 


L -attributed definitions: 

A syntax directed definition is called L-attributed if in the semantic rule of a production 
A —» X 1 X 2 X 3 X n , the inherited attribute of X, depends only on: 

- The attributes (synthesis or inherited) of the symbols Xi X 2 X 31 .Xj_i 

- The inherited attributes of A. 

An inherited attribute can be evaluated in a left to right fashion using a depth first evaluation order. 
Example of L -attributed definition. 

Production 

A —> XYZ {X.i = fi(A.i), Y.i = f 2 (Y.s)} 

X -» YZ { Z.i = f(X.i,Y.i) } 

Example of syntax directed definition which is not L-attributed. 

A -> LM { L.i = fi( A.i), M.i = f 2 ( L.s), A.s = f 3 (M.s) } 

A -> QR I R.i = f 4 (A.i), Q.i = f 5 (R.s), A.s = f 6 ( Q.s) } 

Here semantic rule A.s = f 3 (M.s) violates the rule for L-attributed definition since A.s depends on 
the synthesized attribute of M which is on the right side of the production A —» LM and in 
semantic rule A.s = f 6 ( Q.s) also violates L -attributed definition 

Translation Schemes: A CFG with “semantic actions” embedded into its productions is a 
translation scheme. It is useful for binding order or evaluation into parse tree. 

For Example: Translation scheme for translation simple infix expression involving 
expr —> expr + term {print(“+”) } 
expr —> expr -term (print(“-“)} 
expr —> term 
term 0{print(“0”)} 
term —> 1 {print(“l”)} 


term —> 9{print(“9”)} 


11 


-HGC 




B.Sc. CSIT 


Compiler Design and Construction 


The parse tree for expression: 9-5+2 



9 {print^)} 

Using the depth first traversal of this tree, equivalent postfix expression : 95-2+ 

Writhing the Translation Schemes 

Start with syntax directed definitions. Make sure that we never refer to an attribute that has not 
been defined already. 

For s-attributed definitions, we can simply construct translation schemes by creating an action 
consisting of an assignment for each semantic rules into { ...} at the right most of each production 

For example: 

Production Semantic Rules 

S —> AiA 2 S.s = Ai.s +A 2 .s 

S —^ a A.s = 1 

The translation scheme is: 

S —> AiA 2 {S.s = Ai.s +A 2 .s} 

S —> a{A.s = 1} 

If both synthesized and inherited attributes are involved, 

An inherited attribute for a symbol on the RHS of a production must be computed in an 
action before that symbol. 

An action must not refer to a synthesized attribute of a symbol that is to the right. 

A synthesized attribute of the NT on the LHS can only be computed after all attributes it 
references are already computed. (The actions for such attributes is placed in the right most 
end of production) 

For example, 

1. Incorrect 

S AiA 2 {A/.in =1; A 2 .m =2} 
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S —> a{print(A.in)} 

2. correct 

S —>{Ai.in =1; A 2 .in =2 }AjA 2 
S —> a{print(A.in)} 

3. correct 

S —tfA^in =1}} Aj{ A 2 .in =2}A 2 
S —> a{print(A.in)} 

Top down Translation: 

Inherited attributes can be evaluated in a top-down fashion in a way similar to the technique used 
for elimination of left recursion. Consider the left recursive grammar with translation schemes. 

A —> AiY { A.a = g(Ai.a, Y.y)} ( here A and Ai are same symbol) 

A -> X { A.a = f(X.x) } 

Here , each attributes are synthesized attributes. Eliminating the left recursion from above 
grammar the equivalent grammar is, 

A—>XR 
R—>YR I G 

Taking the semantic actions into transformed grammar, 

A -> X{R.i=f(X.x)}R{A.a =R.s} 

R—>Y{Ri.i = g( R.i,Y.y)}R 1 {R.s = Ri.s} 

R —> G{R.s = R.i} 

To see why the results of left recursive and non -recursive attributes are same, consider the 
following two annotated parse tree 



Y R.i = g(g(f(X.x),Y.y),Y.y) 

F 'g ure 1 | R.s = g(g(f(X.x),Y.y),Y.y) 

Figure 2 
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A.a computed according to above two translation scheme is same. 

In figure 1, A.a is computed bottom up and in figure 2, R.i is computed as top down and A.a is 
computed by the R.i at the bottom is passed up unchanged as R.s. 

Bottom up evaluation of L-attributed definitions 

While evaluating S-attributed definitions during bottom-up parsing is straight forward, evaluating 
inherited attributed attributes is a bit tricky. L-attributed definitions are a simple subclass of 
inherited attributes that can be effectively implemented in most bottom up parsers. 

Consider the type declaration of variables in a language like C described by the following rules. 

D —> T{L.type =T.val}L 
L —>L, id {settype( id. entry,L. type)} 

L —>id{ settype(id.entry, L.type)} 

T —> int { T.val = int} 

T —>float { T.val = float} 

Given the string of the form float id,id , a bottom up evaluation can be traced as 


STACK 

INPUTBUFFER 

Production Used 

Action 

$ 

float id,id$ 


shift 

$float 

id,id$ 

T —afloat 

Reduce 

$T{ T.val =float} 

id,id$ 


shift 

$T id 

,id$ 

L —> id 

reduce 

$T L{ L.type = 
T.val} 

, id$ 


shift 

STL, 

id$ 


shift 

$T L, id 

$ 

L —> L , id 

Reduce 

STL 


D —> TL 

reduce 


The move of a parser for the given string from the grammar and the inherited attributes in 
parser stack. 

For the above parser stack, the stack can be implemented as a pair of stacks(parallel stack) for 
state and value. The state stack is for the grammar symbol X i.e. state[i] and value stack is used 
for holding the attributes of symbol on state stack. Every time the right side of the production for 
L is reduced in above example, T is in the stack just below the right side. We can use this fact to 
access the attribute value T.type for evaluating the attributes. 

If top and ntop be the indices of the top entry in the stack just before and after a reduction takes 
place then from the copy rules defining L.in = T.type , T.type is placed in L.in. So when L —> id 
is applied for reduction, val[top] = val[top-l]. Similarly, since L—> L , id is applied for reduction, 
there are 3 symbols on the right side of production so the 3 symbols in to top of state stack are 
removed and new symbol L is pushed to it at the same time the attribute for it in value stack will 
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be inherited as val[top]=val[top-3]. So the parser should execute some code fragments to obtain 
the attribute of symbol when reduction is applied. Following are the code fragments to be executed 
when the reduction is used. 


Production 

code fragments 

D —> TL 


L L,id 

val[top]=val[top-3] 

L —> id 

val[top] =val[top-l] 

T —> int 

val[ntop]= integer 

T —> float 

val[ntop] = float 


Evaluation order of the attributes: 

In syntax directed translation, the order of evaluation of the attributes is implemented by using the 
dependency graph. Once the dependency graph of the parse tree for a language construct is 
created for the given input string, the topological ordering the node of the dependency graph is the 
order of evaluation of attributes. If b = f( x,y,z) is the semantic rule for any production of a 
grammar where attribute b depends on attributes x,y and z then x,y&z must be evaluated before 
b. 
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Chapter 6: Type Checking 


Token 

Stream 



Figure: Position of Type Checker 


Intermediate 

Representation 


A compiler has to do semantic checks in addition to syntax analysis. Type checking is one of the 
static semantic checks but some systems also use dynamic type checking. 


Static checking: the compiler enforces programming language’s static semantics 


Program properties that can be checked at compile time 


Static checking examples: 

S Type checks. 

o Report an error if an operator is applied to an incompatible operand e.g 

int op(int), op(float); 
int f(float); 


int a, c[10], d; 
d = c+d; 

*d = a; 
a = op(d); 
a = f(d); 
vector<int> v; 


// FAIL type mismatch 
// FAIL not a pointer type 
// OK: overloading (C++) 

// OK: coersion of d to float 
// OK: template instantiation 


S Flow-of-control checks. 

o Statements that causes flow of control to leave a construct must have some 
o place to which to transfer the flow of control e.g. 


myfunc(int a) 

myfunc() 

myfunc(int a) 

{cout«a; 

{••• 

{ 

break; // ERROR 

switch (a) 

while(a) 

//missplaced break 

{ case 0: 

{.... 

statement 


If(i>10) 

} 

break; // OK 
case 1: 

... )) 

Break; //ok 
} 

} 


S Uniqueness checks. 


o There are situations where an object must be defined exactly once, 
o labels, identifiers e.g. 

myfuncO 

{int i, j, i; // ERROR 

S Name-related checks. 

o Sometimes the same names may be appeared two or more times, 
o Beginning and end of a construct 
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Dynamic semantics: checked at run time 

S Compiler generates verification code to enforce programming language’s 
dynamic semantics 

A type system is a collection of rules for assigning type expressions to the parts of a program. A 
type checker implements a type system. A sound type system eliminates run-time type checking 
for type errors. 

A programming language is strongly typed if its compiler can guarantee that there will be no 
type errors during run-time. 

Type Expression: 

The type of a language construct is defined by a “type expression”. A type expression is defined 
as follows: 

• A Basic type is a type expression e.g. integer, char, real etc. 

• A type name is a type expression e.g. if int is named by a variable x , then x is a type 
expression. 

• A type constructor applied to a type expression is a type expression. The constructor 
include - array, product, pointer, function or record, e.g. 

'A T array[I] or array [I,T] is a type expression with I elements of type T. 

If Ti and T 2 are type expressions, then Cartesian product Ti X T 2 is type expression - 
product 

'A If T is a type expression, then pointer(T) is a type expression. 

'A Function in programming languages is a mapping a domain type D to a range type R. 
e.g. int x int —> pointer(char) denotes a function that takes a pair of integer and 
returns a pointer to char. 

'A A record is a structured type 

Specification of a simple type checker: The simple type checker is specified by the translation 
scheme that saves the type information for any identifier. Following is the translation scheme for 
a declaration. 


P -► D;E 


D -► D;D 


D id:T 

{ addtype(id.entry,T.val)} 

T —*■ char 

{ T.val =char } 

T —> int 

{ T.val =int} 

T —> real 

{ T.val =real} 

*Tj 

{ T.val -pointer(T,.val)} 


T —*■ array[intnum] ofTj { T.val=array( 1 ..intnum.val,T].val) } 

The above set of declarations describes a translation schemes that saves the type of an identifier 
in a language like pascal e.g. a:integer saves the type of id ‘a’ as ‘integer’ 
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The type checking expression: 

Following example shows the type checking expression in translation schemes 


E 

E 

E 

E 

E 


E 

E 


id { E.type=lookup(id.entry)} 

literal { E.type=char } 
intliteral { E.type=int} 
realliteral { E.type=real} 

Ej + E 2 { if (E h type=int and E 2 .type=int) then E.type=int 

else if(E!.type=int and E 2 .type=real) then E.type=real 
else if (E!.type=real and E 2 .type=int) then E.type=real 
else if (E I .type=real and E 2 .type=real) then E.type=real 
else E.type=type-error } 

Ei [E 2 ] { if (E 2 .type=int andE 1 .type=array(s,t)) then E.type=t 

else E.type=type-error } 

*E, {if(E I .type=pointer(t))thenE.type=t 

else E.type=type-error } 


Another example: A simple language type checking specification: 

E —>true { E.type = boolean } 

E —> false { E.type = boolean } 

E —> literal { E.type = char } 

E —>num { E.type = integer } 

E —>id { E.type = lookup(id.entry) } 

E —>E 2 + E 2 { E.type := if E h type = integer and E 2 .type = integer 
then integer else type_error } 

E —>Ej and E 2 { E.type := if Etype = boolean and E 2 .type = boolean 
then boolean else type_error } 


Type Checking expression for another grammar: Example 

T -> int { T.type=int} 

T -> char { T.type=char } 

T -> real { T.type=real} 

T -> array [intnum, T 1 ] { T.type=array(l..intnum.val,T 1 .type) } 

T -> Pointer (TJ { T.type=pointer(T 1 .type) } 

Type Checking expression of an array of pointers to real, where array index ranges from 1 to 100 

T-> real {T.type = real} 

E -> array [100, T] {if T.Type = real then T.type=array(1..100,T) else type_error()} 

E-> PointerfEi] {if (Ei.type=array[100,real)) then E.type=Ei.type else E.type=type-error} 


Type checking expression of statements. 

Assignment statement: 

S —> id = E {if (id.type=E.type then S.type=void else S.type=type-error} 

If then else statement: 


S —> if E then Si 


{if (E.type=boolean then S.type=Si.type else S.type=type-error} 
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While starement: 

S —> while E do Si {if (E.type=boolean then S.type=Si.type else S.type=type-error} 

Type Checking expression for function 

E->E 1 (E 2 ) {if(E 2 .type=s andE 1 .type=s->t) then E.type=t 

else E.type=type-error } 

Ex: int/(double x, char y) {... } 

f: double x char —> int 

argument types return type 


Function whose domains are function from two characters and whose range is a pointer of integer. 

T—>int { T.type = int} 

T —> char { T.type = char} 

T —>Pointer[Ti] { T.type = PointerfT h type)} 

E —tEfEJ {iff E 2 .type = (char,char) and E h type = (char,char) —>Pointer(int) 
then E.type = E P type else type_error} 

Example: consider the following grammar for arithmetic expression using an operator ‘op’ to 
integer or real numbers 

E —> Ejop E 2 I num.num\num\id 

Give the syntax directed definition as translation scheme to determine the type of expression 
when two integers are used in expression resulting type is int otherwise real 

Translation scheme will be: 

E —» id { E.type=lookup(id.entry) } 

E —> num { E.type=integer } 

E — * num.num { E.type=real} 

E —> Ej op E 2 { if(E].type=integer and E 2 .type=integer) then E.type=integer 

else if(Ej.type=integer and E 2 .type=real) then E.type=real 
else if(Ej.type=real and E 2 .type=integer) then E.type=real 
else if(Ej.type=real and E 2 .type=real) then E.type=real 
else E.type=type-error } 

Type Conversion and Coercion 

Type conversion is explicit, for example using type casts. Consider expression a + b where a is 
of type integer and b is real. Since the representation of integer and real in the computer system 
is different and different machine instructions are used for operations of integer and reals, the 
compiler must convert one type operand into another type operand of operator + to ensure both 
operands are of same type when addition is takes place. 

The conversion of type of one operand to another can be explicitly using cast operators that the 
type checker must incorporate. 
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Type coercion is implicitly performed by the compiler to generate code that converts types of 
values at runtime (typically to narrow or widen a type) 

Both require a type system to check and infer types from (sub)expressions 

Equivalence of Type expression: 

As long as type expressions are built from basic types and constructors, a notion of equivalence 
between two type expressions is structural equivalence. Two type expressions are structurally 
equivalent iff they are identical. E.g. pointer(Integer) is equivalent to pointer(Integer). 

The algorithm for testing structural equivalence can be adapted to test the equivalence. If two 
type expressions are equal then the operation is performed otherwise it requires type conversions 
if possible or report type error. For example, if two operand are of int and real type, then 
operation can be performed by type checking and conversion but if an array and a record is 
operated as a+r, then there will be type error. 

Following is the example of an algorithm that can be adopted to check the equivalence of the 
type expression. 

boolean sequival (s, t) 

{ 

if s and t are the same basic type 
return true; 

else if s=array (si,s2) and t=array (tl,t2) then 
return sequival (si, tl) and sequival (s2, t2) 
else if s=sl x s2 and t=tl x t2 then 

return sequival (si, tl) and sequival (s2, t2) 
else if s= pointer (si) and t=pointer(tl) then 
return sequival(si, tl) 
else if s=sl->s2 and t=tl->t2 then 

return sequival (si, tl) and sequival (s2, t2) 
else return false 

} 

If array bound si and tl in s=array(sl,s2) and t=array(tl,t2) are ignored if the test for array 
equivalence in above then it can be re-formulated as: 

if s=array (si, s2) and t=array (tl, t-2) then 
return sequival(s2,t2) 
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Intermediate Code Generation 

The given program in a source language is converted into an equivalent program in an 
intermediate language by the intermediate code generator. Intermediate codes are machine 
independent codes. The intermediate representation are : 

• Graphical representation e.g. Abstract Syntax Tree(AST), DAGS 

• Postfix notations 

• Three Address codes 

Some languages have well defined intermediate codes e. g. JAVA - JVM 

Let us take an example of a syntax directed definitions for assignment for intermediate 

representation (AST) 


Production 

SemanticRule 

S->id=£ 

S.mptr=mknode(‘ :=’ ,mk/eq/’(id,id.entry),£'.nptr) 

E—^E\+E 2 

E.n\)ir=mknode(‘+' ,E\.rvpiY,E 2 .xvpt!i) 

E^>Ex*E 2 

E.nptr=mknode( ‘ *’ ,£) .nptr,Z? 2 -nptr) 


E. nptv=mknode( ‘ uminus ’ ,E i .nptr) 

£->(£i) 

Emptr := Z?|.nptr 

£->id 

L.nptr := mkleafiiA, id.entry) 


The Abstract syntax tree for the expression a*(b+c) can be as 
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Abstract syntax tree vs DAGS for expression :a = b*-c + b*-c 

a = b * -c + b * -c 



Three Address Code: 

A three address code is a sequence of the general form x=y op z, where x,y and z are names, 
constants or compiler generated temporaries and op is for operator. So the source language 
expression like x+y*z might be translated into a sequence as 

tl = y*z 
t2 = x+tl 

where tl and t2 are compiler generated temporaries. 

Similarly, x=y+z*w should be represented as 

tl=z*w 
t2=y+tl 
x=t2 et. 

So code for a=b*-c+b*-c should be, 


For AST: 


For DAG: 

tl= -c 


tl= -c 

t2 = b * tl 


t2 = b * tl 

t3 = -c 


t3 = t2+ t2 

t4 = b* tl 


a =t3 

t5 = t2+t4 



a =t5 
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In fact three address code is linearization of tree. Followings are the 3-address code statements 
for different statements. 

Assignment statement: x = y op z (binary operator op) 

x = op y (unary operator op) 

Copy statement: x = z ( also called copy assignment) 

Unconditional jump: goto L (jump to label L) 

Conditional jump : if x relop y goto L 
Procedure call: 

param xl 
param x2 


param x n 

callp,n i.e. call procedure P(xl,x2,.x n ) 

Indexed assignment: 

x=y[i] 
x[i] = y 

Address and pointer assignments: 

x = &y 
x = *y 
*x = y 

When three address code is generated temporary names are made up for the interior nodes of a 
syntax tree. For production E-* El + E2, the value of non ter mi nal E is computed into a new 
temporary t. 

Syntax Directed Translation into 3-address codes 

• First deal with the assignment 

• Use attributes 

o E.place : the name that will hold the value of E 
o E.code : hold the three address code statements that evaluates E 

• Use function newtemp() that returns a new temporary variable that we can use. 
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• Use a function say gen() to generate a single three-address statement given the necessary 
information, e.g. gen(E.place ‘:=’ Ci.place ‘+’ C 2 .piace) 

o => t3 = tl + t2 


For example: 
S->id = E 
C—»C|+C 2 

C->Ci*C 2 

C-»-Cj 

E-ME X ) 

E-> id 

E —> num 

Here; 


{ S.code = E.code II gen(id.place ‘= ’ E.place);} 

{E.p\sLce=newtemp()', E. code = Ci.code II C 2 .code II gen(E. place ‘=’ 
Ci.place ‘+’ C 2 .place) } 

{C.pl ace=newtempQ ; 

C.code = Ci.code IIC 2 .code II gen(E. place ‘=’ Ci.place C 2 .place)} 

{ E.p\ace=newtemp (); 

C.code = Ci.code II gen{E. place ‘=’ 'uminus' Ci.place)} 

{e.place=Ci.place 
C. code :=''} 

{ C.place=id.entry 
C. code := ”} 

{ C.pl ‘dce=newtempi)] 

C.code := gen{E. place ‘=’ num.value)} 


Code Generation: 



gen(C.pi ace 


Ci.place ‘+* C 2 .place 


tl + t2 


Implementation of Three address codes: 

A three address statement is an abstract form of intermediate code. In compiler such statements 
can be implemented as records with fields for the operators and operands 
Followings are the three address statements representations. 

1. Quadruples: 

• A records structure with four fields: op, argl,arg2,result. 

• The three address statements x = y op z is represented by placing y in argl, z in arg2 
and x in result. 
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• statement with unary operators like x = -y or x =y do not use arg2. 

• operators like param use nither arg2 nor result 

• conditional and unconditional jumps put the target label in result, 
e.g. for statement a = b*-c +b*-c , the quadruples is: 


pos 

op 

argl 

arg2 

result 

(0) 

uminus 

c 


tl 

(1) 

* 

b 

tl 

t2 

(2) 

uminus 

c 


t3 

(3) 

* 

b 

t3 

t4 

(4) 

+ 

t2 

t4 

t5 

(5) 

= 

t5 


a 


2. Triples: 


pos 

op 

argl 

arg2 

result 

(0) 

uminus 

c 


tl 

(1) 

* 

b 

tl 

t2 

(2) 

uminus 

c 


t3 

(3) 

* 

b 

t3 

t4 

(4) 

+ 

t2 

t4 

t5 

(5) 

= 

t5 


a 


3. Indirect triples: 

• Indirect triples involve in listing of pointers to triples rather than listing of triples 
themselves, e.g. the three address indirect triples for same statement above is 


pos 

op 


pos 

op 

argl 

arg2 

(0) 

(io)- 


(10) 

uminus 

c 


(1) 

(11)- 


(ID 

* 

b 

(10) 

(2) 

(12)- 


(12) 

uminus 

c 


(3) 

(13)- 

(13) 

* 

b 

(12) 

(4) 

(14)- 

(14) 

+ 

t2 

(13) 

(5) 

(15)- 

(15) 

= 

t5 

(14) 


Assignments and Symbol Table 

We assume that names/ addresses stand for pointer to their symbol table entries since other info 
are needed for final code generation. 
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• Under this assumption, temporary names must be also entered into symbol table as they 
are created by the newtemp function. 

• The lexeme for name id is given by id.entry and the function lookup(id.entry) returns nill 
if there is no entry found otherwise a ptr to the entry is returned. 

• Instead of using code attribute, let a procedure emit() produce three address code to 
output file. 

o This is always possible if the code of the non-terminal on the left is obtained by 
concatenating the code attributes of the non-terminals on the right in the same 
order, e.g. E -* El op E2 { E. place = newtemp 

emit(E.place '=' El.place 'op' E2.place} 

using this approach , the translation scheme to produce the three address code for assignment can 
be written as: 

S->id = E { p = lookup(id.entry); 

ifp!= null then emit(p ’='E.place) else error; 

E—>Ei+E 2 {E.p\ace=newtemp(y, 

emit(E. place ‘=’ Emplace ‘+’ E 2 . place) } 

E—>Ei*E 2 { E.p\ace=newtemp(); 

emit(E. place ‘=’ Ei.place Eo.place) } 

E—>-Ei { E.p\ace=newtemp(y, 

emit(E. place ‘=’ 'uminus' Ei.place)} 

E—HEi) { E.place=Ei.place} 

E —> id {p=lookup(id.entry) 

ifp!=NULL then E.place=p else error 

} 


Boolean Expression: 

Boolean expressions are used to either compute logical values or as conditional expressions in 
flow of control statements 

• We considered Boolean expression with the following grammar 

1. E-»E or E I E and E I not E I (E) I id relop id I true I false 

• There are two methods to evaluate Boolean expressions 

1. Numerical Representation : Encode true with T and false with 'O' and we proceed 
analogously to arithmetic expression. 

2. Jumping Code: We represent the value of a Boolean expression by a position 
reached in a program. 
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Numerical: 

- Expressions will be evaluated from left to right assuming that: or and and are left 
associative, and that or has lower precedence than and , then not, 

e.g. the translation for a or(b and (not c)) is: 

tl = not c 
t2 = b and tl 
t3 = a or t2 

e.g. 2: A relational expression such as a<b is equivalent to the conditional statement if a<b then 
1 else 0. Its translation involves jumps to labeled statements. 

100: if a<b goto 103 
101: t =0 
102: goto 104 
103: t =1 
104: 

The following s -attributed definition makes use of global variable nextstat that gives the index 
of the next three address code statement and is incremented by emit. We use the attribute op to 
determine which of the comparison operator is represented by relop. 

El -* El or E2 { E.place = newtemp; emit( E.place '=' El.place 'or' E2.place)} 

E -* El and E2 { E.place = newtemp; emit(E.place '=' El.place 'and' E2.place)} 

E -* not El { E.place = newtemp; emit( E.place '=' 'not' El.place)} 

E -* (El) { E.place = El.place} 

E -* idl relop id2 { E.place = newtemp; 

emit( 'if idl.place relop.op id2.place 'goto' nextstat +3) 
emit( E.place '=' '0') 
emit ('goto' nextstat +2) 
emit( E.place '=' T) } 

Jumping code for Boolean Expression 

The value of Boolean Expression is represented by a position in the code. 

- consider example a<b above. 

1. We can tell what value t will have by whether we reach statement 101 and 103 

- Jumping code is extremely useful when Boolean expression are in the context of flow of 
control statement. 

- consider the follow of control statements generated by following grammar. 
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S if E then S I if E then SI else S2 I while E do S 

Flow of control statements: 

- In the translation, we assume that a three address code statement can have a symbolic 
label and that the function newlabel generates such labels: 

- We associate with E to labels using inherited attributes: 

1. E.true - the label to which control flows if E is true. 

2. E.false - the label to which control flows if E is false. 

- We associate to S the inherited attribute S.next that represents the label attached to the 
first statement after the code for S. 

The following figures show how the flow of control statements are translated. 


E.true : 
E.false: 


E.code 


SI.code 


to E.true: 
to E.false: 


if -then: 



Procedure Calls: The procedure call can be represented in three address code as below. Let us 
consider the following grammar. 

S —» call id( Elist) 

Elist —> Elist, E I E 

The procedure call statement: call fun(a+l, b, 7) should be represented as: 


tl=a+l 

t2=7 

param tl 

param b 

param t2 

call fun 3 





Code Generation & Optimization 

How the target codes are generated optimally from an intermediate form of programming language. 



Code produced by compiler must be correct and high quality. Source-to-target program transformation 
should be semantics preserving and effective use of target machine resources. Heuristic techniques should 
be used to generate good but suboptimal code, because generating optimal code is un-decidable. 

Code Generator Design Issues 

The details of code generation are dependent on the target language and operating system. Issues such as 
memory management, instruction selection, register allocation, evaluation order are in almost all code - 
generation problems. 

The issues to code generator design includes: 

Input to the code generator: The input to the code generator is intermediate representation together with 
the information in the symbol table. What type of input postfix, three-address, dag or tree. 

Target Program: Which one is the out put of code generator: Absolute machine code (executable code), 
Re-locatable machine code (object files for linker), Assembly language (facilitates debugging), Byte code 
forms for interpreters (e.g. JVM) 

Target Machine: Implementing code generation requires thorough understanding of the target machine 
architecture and its instruction set. 

Instruction Selection: Efficient and low cost instruction selection is important to obtain efficient code. 

Register Allocation: Proper utilization of registers improve code efficiency 

Choice of Evaluation order: The order of computation effect the efficiency of target code. 

The Target Machine 

Consider a hypothetical target computer is a byte-addressable mac hi ne (word = 4 bytes) and n general 
propose registers, RO, Rl, ..., Rn-1. It has two address instruction of the form: 

op source, destination 

It has the following op-codes : 

MOV (move content of source to destination), 

ADD (add content of source to destination) 
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SUB (subtract content of source from destination .) 

MUL (multiply content of source with destinanation) 

The source and destination of instructions are specified by combining register and memory location with 
address modes. The address mode together with assembly forms and associated cost are: 


Addressing modes: 


Mode 

Form 

Address 

Added Cost 

Absolute 

M 

M 

1 

Register 

R 

R 

0 

Indexed 

c( R) 

c+contents (R) 

1 

Indirect 

register 

*R 

contents (R) 

0 

Indirect 

indexed 

*c( R) 

contents ( c+contents (R) ) 

1 

Literal 

#c 

N/A 

1 


Instruction Costs 


• Machine is a simple, non-super-scalar processor with fixed instruction costs 

• Realistic machines have deep pipelines, I-cache, D-cache, etc. 

• Define the cost of instruction 


Instruction 

MOV R0,R1 
MOV R0,M 
MOV M,R0 
MOV 4(R0),M 
MOV *4(R0),M 
MOV #1,R0 
ADD 4(R0),*12(R1) 


= 1 + cost( source-mode) + cost( destination-mode) 
operation 

Store content( RO) into register R1 1 

Store contend RO) into memory location M 2 

Store contend M) into register RO 2 

Store contents(4+contents(R0)) into M 3 

Store contents(contents(4+contents( RO))) into M 3 

Store 1 into RO 2 

Add contents(4+contents(R0)) to value at location 
contents^ 1 2+contents(Rl)) 3 


Instruction Selection 

Instruction selection is important to obtain efficient code. Suppose we translate three-address code 

x :=y+ z 


to: MOV y,R0 

a 

: = a +l MOV a ' R0 

v ADD #1,RO 

ADD Z,RO 


MOV RO ,X 

General way of translation 

Better 

MOV RO,a 

„ Cost = 6 

Best 



43 - 


ADD #1, 

a INC a 


Cost = 3 

Cost = 2 


Picking the shortest sequence of instructions is often a good approximation 
of the optimal result 
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Register Allocation and Assignment 

Accessing values in registers is much faster than accessing 
main memory. Register allocation denotes the selection of 
which variables will go into registers. Register assignment 
is the determination of exactly which register to place a 
given variable. The goal of these operations is generally to 
minimize the total number of memory accesses required 
by the program. 

Finding an optimal register assignment in general is NP- 
complete. 


Register Allocation and Assignment 



t:=t/d 


t:=t/U 


J3L { Ri=t > 


J3L { RO=a. Rl=t > 


MOV a,R1 
MUL fc> , R1 
ADD a , R1 
DXV d,Rl 
MOV R1,t 


MOV a, RO 
MOV RO,R1 
MUL fc> , R1 
ADD RO,R1 
DXV d,R1 


MOV R1,t 


Choice of Evaluation Order 


When instructions are independent, their evaluation orde: 
can be changed 


MOV a,R0 
ADD b,R0 
MOV RO,tl 



tl:=a+b MOV c,R1 

t2:=c+d _\ ADD d,Rl 


tl:=a+b 
t2:=c+d 
t3:=e*t2 


t4:=tl-t3 MUD R1/R0 MOV c,RO 

MOV tl,Rl ADD d,RO 

reorder! I sub ro,ri mov e,Ri 

MOV Rl,t4 MUL R0,R1 


t4:=tl-t3 


MOV c,R0 
ADD d,RO 
MOV e ,R1 
MUL RO,R1 
MOV a ,R0 
ADD b,RO 
SUB R1, RO 
MOV R0,t4 


t2:=c+d 
t3:=e*t2 
tl:=a+b 



t4:=tl-t3 
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Basic Blocks and Control Flow Graphs 

Basic Blocks 

A basic block is a sequence of consecutive instructions in which flow 
of control enters by one entry point and exit to another point without 
halt or branching except at the end. 


MOV 1,RO 
MOV n,Rl 
JMP L2 

LI: MUL 2,RO 
SUB 1,R1 
L2: JMPNZ R1,L1 



MOV 1,RO 
MOV n,Rl 
JMP L2 


LI: MUL 2,R0 
SUB 1,R1 


L2: JMPNZ R1,LI 


Basic Blocks and Control Flow Graphs 

Flow Graphs 


A flow graph is a graphical depiction of a sequence of instructions with 
control flow edges. 

A flow graph can be defined at the intermediate code level or target code 
level. 


The nodes of flow graphs are the basic blocks and flow-of-control to 
immediately follow node connected by directed arrow. 


Ll: 
L2: 


MOV 1,RO 
MOV n,Rl 
JMP L2 
MUL 2,R0 
SUB 1,R1 
JMPNZ R1,L1 
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Basic Blocks Construction Algorithm 


Input : A sequence of three-address statements 

Output'. A list of basic blocks with each three-address statement 
in exactly one block 

1. Determine the set of leaders , the first statements if basic blocks 

a. The fust statement is the leader 

b. Any statement that is the target of a conditional or goto is a leader 

c. Any statement that immediately follows conditional or goto is a leader 

2. For each leader, its basic block consist of the leader and all 
statements up to but not including the next leader or the end 
of the program 

Loops 

A loop is a collection of basic blocks, such that 

- All blocks in the collection are strongly connected 

- The collection has a unique entry, and the only way to reach a block in the loop 


is through the entry 


















Equivalence of Basic Blocks 


Two basic blocks are (semantically) equivalent if they 
compute the same set of expressions 


b 

:= 0 


tl 

: = a + 

b 

t2 

:= c * 

tl 

a 

:= t2 



Transformation 

a := c*a 
b := 0 



Blocks are equivalent, assuming tl and t2 are dead : no longer used (no longer live) 

Transformations on Basic Blocks 

A code-improving transformation is a code optimization to improve 
speed or reduce code size 

Global transformations are performed across basic blocks 
Local transformations are only performed on single basic blocks 
Transformations must be safe and preserve the meaning of the code 
A local transformation is safe if the transformed basic block is 
guaranteed to be equivalent to its original form 
Some local transformation are: 

Common-Subexpression Elimination 
Dead Code Elimination 
Renaming Temporary Variables 
Interchange of Statements 
Algebraic Transformations 
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Common-Subexpression Elimination 

Remove redundant computations 


Look at 2 nd and 4 th : 


compute same 

a := b + c 


a := b + c 

expression 

b := a - d 

i— \ 

b := a - d 


c : = b + c 

W 

c : = b + c 


d := a - d 


d := b 


Look at 1 st and 3 rd : 
b is redefine in 2 nd 
therefore different in 
3 rd . not the same 
expression 


tl 

:= b * c 

t2 

: = a - tl 

t3 

:= b * c 

t4 

:= t2 + t3 



tl 

:= b * c 

t2 

:= a - tl 

t4 

:= t2 + tl 


Dead Code Elimination 


Remove unused statements 


4 


b := a + 1 


Assuming a is dead (not used) 


( 


goto L2 

1 

b := x + y 


Remove unreachable code 
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Renaming Temporary Variables 


Temporary variables that are dead at the end of a block can be safely 
renamed 

The basic block is transforms into an equivalent block in which each 
statement that defines a temporary defines a new temporary. Such a 
basic block is called normal-form block or simple block. 
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Normal-form block 

Interchange of Statements 


tl 

:= b + 

C 

t2 

: = a - 

tl 

t3 

:= tl * 

► d 

d : 

+ 

CM 

■P 

II 

t3 


tl 

:= b + 

c 

t2 

: = a - 

tl 

tl 

:= tl * d 

d : 

II 

ft 

N> 

+ 

tl 


Independent statements can be reordered without effecting 
the value of block to make its optimal use. 
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tl 

: = b + 

c 

t3 

:= tl 

* d 

t2 

:= a - 

tl 

d : 

II 

rt 

N> 

+ 

t3 


tl 

:= b + 

c 

t2 

: = a - 

tl 

t3 

:= tl 

* d 

d : 

+ 

CM 

■P 

II 

t3 


Note that normal-form blocks permit all statement interchanges 
that are possible 









Algebraic Transformations 


Change arithmetic operations to transform blocks to algebraic 
equivalent forms 

Simplify expression or replace expensive expressions by cheaper 
ones. 
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Transforms to 
simple and 
equivalent statement 

Next-Use Information 

Next-use information is needed for dead-code elimination and register 
assignment (if the name in a register is no longer needed, then the 
register can be assigned to some other name) 

If i: x = ... and j: y = x + z are two statements i & j, then next-use of v 
at /' is j. 

Next-use is computed by a backward scan of a basic block and 
performing the following actions on statement 

i: x : = y op z 

- Add liveness/next-use info on x, y. and r to statement i (whatever in the 

symbol table) AU nontemporary variables and 

„ - . . . .. w . temporary that is used across 

- Before gomg up to the previous statement (scan up): the block are considered live. 

• Set x info to “not live” and “no next use” 

• Set y and - info to “live” and the next uses ofy and r to i 


tl 

: = a 

- a 

t2 

:= b 

+ tl 

t3 

:= t2 

** 2 


In statement 3. 
usually require 
a function call 


tl 

:= 0 

t2 

:= b 

t3 

:= t2 * t2 


9 






Computing Next-Use 

step i Example 

a := b + c 


j: t := a + b [ live(a) = true, live(h) = true, lh>e(€) - true, 

nextuse(a) = none, nextuseQo) = none, nextuse(t) = none ] 
Attach current live/next-use information 
Because info is empty, assume variables are live 
(Data flow analysis Ch.10 can provide accurate information) 

Step 2 


j: t := a + b 


live(a) = true 

nextuse(a) = j 

Iive(b) = true 

nextuseQo) =j 

live(t) = false 

nextuse(t) = none 


[ live( a) = true. live( b) = true, live( t) = tine, 
nextuse{ a) = none, nextuse(b) = none, nextuse(t) = none ] 


Compute live/next-use information at j 


Computing Next-Use 


Step 3 j. a : = b + c [ live(a) = true, live{ b) = true, live(c) = false, 

nextuse(a) = j, nextuse(h) = j , nextuse(c) = none ] 
j: t : = a + b [ live( a) = true, live( b) = true, //ve(t) = true, 

nextuse(a) = none, nextuse(b) = none, nextuse(t) = none ] 
Attach current live/next-use information to i 


Step 4 
/: a 


live(a) = false 

nextuse(a) = none 

live(b) = tme 

nextuseib) = i 

live(c) = tme 

nextuse(c) = i 

Iive( t) = false 

nextuse(t) = none 



: = b + c [ live(a) = true, live( b) = true, live(c) = false, 

nextuse(a) =j, nextuse(b) =j, nextuse( c) = none ] 


j: t : = a + b [ live(a) = false. lixe( b) = false, live( t) = false, 

nextuse(a) = none, nextuse(b) = none, nextnse(t) = none ] 


Compute live/next-use information i 
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Code Generator 


Generates target code for a sequence of three-address 
statements using next-use information 
Uses new function getreg to assign registers to variables 
Computed results are kept in registers as long as possible, 
which means: 

- Result is needed in another computation 

- Register is kept up to a procedure call or end of block 
Checks if operands to three-address code are available in 
registers 

Code Generation Algorithm 

For each statement x :=y op z 

1. Set location L = getregiy , z) // to store the result of y op z 

2. If v £ L then generate //L is address descriptor -wait! 

MOV y \L //to place copy of y in L 

where y ’ denotes one of the locations where the value of v is 
available (choose register if possible) 

3. Generate instruction 

OP z\L 

where z ’ is one of the locations of z; 

Update register/address descriptor of x to include L 

4. If v and/or z has no next use and is stored in register, update 
register descriptors to remove v and/or z 


n 


Register and Address Descriptors 


A register descriptor keeps track of what is currently stored 
in a register at a particular point in the code, e.g. a local 
variable, argument, global variable, etc. 

MOV a , RO “RO contains a” 

An address descriptor keeps track of the location where the 
current value of the name can be found at run time, e.g. a 
register, stack location, memory address, etc. 

" MOV a,R0 

MOV RO , R1 “a in RO and Rl” 

The getreg Algorithm 


To compute getregky^) 

1. If v is stored in a register R and R only holds the value y, and v 
has no next use, then return R: 

Update address descriptor: value y no longer in R 

2. Else, return a new empty register if available 

3. Else, find an occupied register R\ 

Store contents (register spill) by generating 
MOV RM 

for eveiy M in address descriptor ofy; 

Return register R 

4. If not used in the block or no suitable register return a memory 
location 
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Code Generation 

Example 

Statement: d : = (a-b) + (a - c) + (a - c) 


Statements 

Code Generated 

Register Descriptor 

Address Descriptor 



Registers empty 


t := a - b 

MOV a,R0 

RO contains t 

t in RO 


SUB b,R0 



u := a - c 

MOV a,R1 

RO contains t 

t in RO 


SUB c,Rl 

R1 contains u 

u in R1 

v := t + u 

ADD R1,R0 

RO contains v 

u in R1 



R1 contains u 

v in RO 

d : = v + u 

ADD R1,R0 

RO contains d 

d in RO 


MOV RO,d 


d in RO and memory 


Peephole Optimization 

Statement-by-statement code generation often produce redundant 
instructions that can be optimize to save time and space requirement 
of target program. 

Examines a short sequence of target instructions in a window 
(peephole) and replaces the instructions by a faster and/or shorter 
sequence whenever possible. 

Applied to intermediate code or target code 
Typical optimizations: 

- Redundant instruction elimination 

- Flow-of-control optimizations 

- Algebraic simplifications 

- Use of machine idioms 
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Eliminating Redundant Loads and Stores 


This type code is not generated by our 
MOV RO t 3. algorithm of page 25 

MOV a ,R0 

The second instruction can be deleted because first ensures 
value of a in RO, but only if it is not labeled with a target 
label 

- Peephole represents sequence of instructions with at most one 
entiy point 

The first instruction can also be deleted if live( a) = false 

Deleting Unreachable Code 

An unlabeled instruction immediately following an 
unconditional jump can be removed 



4 
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Branch Chaining 

Shorten chain of branches by modifying target labels 



0 



Remove redundant jumps as well 

goto LI 


LI: if a < b got L2 
goto L3 


0 


if a < b goto L2 
goto L3 


Other Peephole Optimizations 

Reduction in strength : replace expensive arithmetic operations with 
cheaper ones 



a := x A 2 
b := y / 8 


4 

a := x * x 
b : = y » 3 

Utilize machine idioms (use addressing mode inc) 


a := a + 1 


3 

inc a 

Algebraic simplifications 



a := a + 0 

b := b * 1 


3 
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Run Time Storage management 


A compiler contains a block of storage from the operating system for the compiled program to 
run in. This run time storage might be sub-divided to hold 

1. The generated target code 

2. data objects and 

3. a counterpart of the control stack to keep track of procedure activation. 

• The size of generated target code is fixed at compile time so it can be placed in a statically 
determined area - low end of memory. 

• Some of data objects may also be known at compile time so these too can be placed in to 
statically determined area. 

• The addresses of these data objects can be compiled into target code 

• For the activation of procedure, when a call occurs, execution of an activation is interrupted 
and information about the status of the machine such as value of program counter, machine 
register is saved into stack until the control returns from call to the activation. 

• Data objects whose life times are contained in that of an activation can be allocated on the 
stack along with other information associated with the activation. 

• Separate area of run time storage , called heap, holds other information. 

The management of run time storage by sub-division is: 

• The size of stack and heap may change during execution. 


• By convention , stack grows down and heap grows up 


Information needed by a single execution of a procedure is managed using a contiguous block of 
storage called an activation record: 


Code 

Static Data 
Stack 


Heap 


Returned value 


actual parameter 


An activation record i s a collection of fields, starting from the field for temporaries as 
value returned after execution 
used by the calling procedure to call procedure, 
points to the activation record of the caller. 

Non local data held in other activation record. 

State of the machine just before procedure call 
Data that are local to an execution. 

Temporary values used for evaluation of expression 


optional control link 
optional access link 
saved machine state 


local data 


temporaries 


Since , run time allocation and de-allocation of activation records occurs as part of procedure 
call-return sequences, following three address statements are in focus. 

1. call 

2. return 

3. halt 

4. action - a place holder for other statements 
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Now , consider the following input to the code generator. 



Using the static allocation, 

A call statement in the intermediate code is implemented by two target machine instruction 
MOV and GOTO 

• The code constructed from procedure C and p above using arbitrary address 100 and 200 
as: 

Assume action takes cost of 20 bytes. - MOV and GOTO + 3 constants cost = 20 bytes 
The target code for the input above will be as: 

100: ACTION 1 

120: MOV #140,364 /* saves return address 140 */ 

132: ACTION2 
160: HLT 


/*Code for P */ 

200: ACTION3 

220: GOTO *364 /* returns to address saved in location 364 */ 


/* 300-363 hold activation record for c */ 
300: /* return address */ 

304: /*local data for c */ 


/* 364-451 holds activation record for P */ 

364: /*return address */ 

368: /* local data for p */ 

• The MOV instruction at address 120 saves the return address 140 in machine status field 
- the first word in activation record of p. 

• The GOTO instruction at 132 transfers control to first instruction to the target code of 
called procedure. 


*364 represents 140 when GOTO statement at address 220 is executed, control then 
returns to 140. 
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